Aurora (ours, r=64)

Reported on 18 benchmarks across 4 tasks

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision16 results

VideoonDiDeMo
text-to-video Median Rank
1
best: 8.3 (Collaborative Experts)
VideoonDiDeMo
text-to-video R@10
85.3
best: 94.2 (vid-TLDR (UMT-L))
VideoonDiDeMo
text-to-video R@5
77.4
best: 91.2 (vid-TLDR (UMT-L))
VideoonDiDeMo
text-to-videoR@1
53.1
VideoonMSR-VTT
text-to-video R@1
52.4
best: 64 (GRAM)
VideoonMSR-VTT
text-to-video R@10
82
best: 89.6 (VAST)
VideoonMSR-VTT
text-to-video R@5
73.9
best: 84.3 (VAST)
VideoonMSR-VTT
text-to-videoMedian Rank
1
Video RetrievalonDiDeMo
text-to-video Median Rank
1
best: 8.3 (Collaborative Experts)
Video RetrievalonDiDeMo
text-to-video R@10
85.3
best: 94.2 (vid-TLDR (UMT-L))
Video RetrievalonDiDeMo
text-to-video R@5
77.4
best: 91.2 (vid-TLDR (UMT-L))
Video RetrievalonDiDeMo
text-to-videoR@1
53.1
Video RetrievalonMSR-VTT
text-to-video R@1
52.4
best: 64 (GRAM)
Video RetrievalonMSR-VTT
text-to-video R@10
82
best: 89.6 (VAST)
Video RetrievalonMSR-VTT
text-to-video R@5
73.9
best: 84.3 (VAST)
Video RetrievalonMSR-VTT
text-to-videoMedian Rank
1

Visual Question Answering (VQA)onVQA v2 test-dev
Accuracy
77.69
best: 84.3 (PaLI)
Visual Question AnsweringonVQA v2 test-dev
Accuracy
77.69
best: 82.3 (BLIP-2 ViT-G OPT 6.7B (fine-tuned))