Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/UMT-L (ViT-L/16)

UMT-L (ViT-L/16)

Reported on 103 benchmarks across 7 tasks · 1 paper · 67 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision98 results

VideoonSSv2-template retrieval
text-to-video R@1· uses extra data· 2023-03-28
90.8
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonActivityNet
text-to-video R@1· uses extra data· 2023-03-28
66.8
best: 74.1 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonActivityNet
text-to-video R@10· uses extra data· 2023-03-28
94.9
best: 96.1 (GRAM)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonActivityNet
text-to-video R@5· uses extra data· 2023-03-28
89.1
best: 90.9 (VAST)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonActivityNet
video-to-text R@1· uses extra data· 2023-03-28
64.4
best: 69.7 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonActivityNet
video-to-text R@10· uses extra data· 2023-03-28
94.8
best: 95.4 (GRAM)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonActivityNet
video-to-text R@5· uses extra data· 2023-03-28
89.1
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonSSv2-label retrieval
text-to-video R@1· uses extra data· 2023-03-28
73.3
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonSSv2-label retrieval
text-to-video R@10· uses extra data· 2023-03-28
96.6
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonSSv2-label retrieval
text-to-video R@5· uses extra data· 2023-03-28
92.7
best: 93.3 (vid-TLDR (UMT-L))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonDiDeMo
text-to-video R@1· uses extra data· 2023-03-28
70.4
best: 74.2 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonDiDeMo
text-to-video R@10· uses extra data· 2023-03-28
93.5
best: 94.2 (vid-TLDR (UMT-L))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonDiDeMo
text-to-video R@5· uses extra data· 2023-03-28
90.1
best: 91.2 (vid-TLDR (UMT-L))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonDiDeMo
video-to-text R@1· uses extra data· 2023-03-28
65.7
best: 71.9 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonDiDeMo
video-to-text R@10· uses extra data· 2023-03-28
93.3
best: 93.8 (vid-TLDR (UMT-L))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonDiDeMo
video-to-text R@5· uses extra data· 2023-03-28
89.6
best: 89.8 (vid-TLDR (UMT-L))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonMSR-VTT
text-to-video R@1· uses extra data· 2023-03-28
58.8
best: 64 (GRAM)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonMSR-VTT
text-to-video R@10· uses extra data· 2023-03-28
87.1
best: 89.6 (VAST)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonMSR-VTT
text-to-video R@5· uses extra data· 2023-03-28
81
best: 84.3 (VAST)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonLSMDC
text-to-video R@1· uses extra data· 2023-03-28
43
best: 46.4 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonLSMDC
video-to-text R@1· uses extra data· 2023-03-28
41.4
best: 46.7 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonKinetics-700
Top-5 Accuracy· uses extra data· 2023-03-28
96.7
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonMiT
Top 1 Accuracy· uses extra data· 2023-03-28
48.7
best: 53.1 (OmniVec2)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonMiT
Top 5 Accuracy· uses extra data· 2023-03-28
78.2
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonSSv2-template retrieval
text-to-video R@1· uses extra data· 2023-03-28
90.8
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonActivityNet
text-to-video R@1· uses extra data· 2023-03-28
66.8
best: 74.1 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonActivityNet
text-to-video R@10· uses extra data· 2023-03-28
94.9
best: 96.1 (GRAM)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonActivityNet
text-to-video R@5· uses extra data· 2023-03-28
89.1
best: 90.9 (VAST)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonActivityNet
video-to-text R@1· uses extra data· 2023-03-28
64.4
best: 69.7 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonActivityNet
video-to-text R@10· uses extra data· 2023-03-28
94.8
best: 95.4 (GRAM)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonActivityNet
video-to-text R@5· uses extra data· 2023-03-28
89.1
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonSSv2-label retrieval
text-to-video R@1· uses extra data· 2023-03-28
73.3
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonSSv2-label retrieval
text-to-video R@10· uses extra data· 2023-03-28
96.6
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonSSv2-label retrieval
text-to-video R@5· uses extra data· 2023-03-28
92.7
best: 93.3 (vid-TLDR (UMT-L))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonDiDeMo
text-to-video R@1· uses extra data· 2023-03-28
70.4
best: 74.2 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonDiDeMo
text-to-video R@10· uses extra data· 2023-03-28
93.5
best: 94.2 (vid-TLDR (UMT-L))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonDiDeMo
text-to-video R@5· uses extra data· 2023-03-28
90.1
best: 91.2 (vid-TLDR (UMT-L))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonDiDeMo
video-to-text R@1· uses extra data· 2023-03-28
65.7
best: 71.9 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonDiDeMo
video-to-text R@10· uses extra data· 2023-03-28
93.3
best: 93.8 (vid-TLDR (UMT-L))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonDiDeMo
video-to-text R@5· uses extra data· 2023-03-28
89.6
best: 89.8 (vid-TLDR (UMT-L))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonMSR-VTT
text-to-video R@1· uses extra data· 2023-03-28
58.8
best: 64 (GRAM)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonMSR-VTT
text-to-video R@10· uses extra data· 2023-03-28
87.1
best: 89.6 (VAST)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonMSR-VTT
text-to-video R@5· uses extra data· 2023-03-28
81
best: 84.3 (VAST)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonLSMDC
text-to-video R@1· uses extra data· 2023-03-28
43
best: 46.4 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonLSMDC
video-to-text R@1· uses extra data· 2023-03-28
41.4
best: 46.7 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSR-VTT
video-to-text R@10· uses extra data· 2023-03-28
69.6
best: 84.1 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSR-VTT
video-to-text R@5· uses extra data· 2023-03-28
59.8
best: 77.5 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSVD
text-to-video R@1· uses extra data· 2023-03-28
49
best: 59.3 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSVD
text-to-video R@5· uses extra data· 2023-03-28
76.9
best: 84.4 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSVD
video-to-text R@1· uses extra data· 2023-03-28
74.5
best: 83.3 (InternVideo2-1B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSVD
video-to-text R@10· uses extra data· 2023-03-28
92.8
best: 97.9 (LanguageBind(ViT-L/14))
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSVD
video-to-text R@5· uses extra data· 2023-03-28
89.7
best: 94.3 (InternVideo2-1B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonDiDeMo
text-to-video R@1· uses extra data· 2023-03-28
48.6
best: 57.9 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonDiDeMo
text-to-video R@5· uses extra data· 2023-03-28
72.9
best: 80 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonDiDeMo
video-to-text R@1· uses extra data· 2023-03-28
49.9
best: 57.1 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonDiDeMo
video-to-text R@10· uses extra data· 2023-03-28
81.4
best: 85 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonDiDeMo
video-to-text R@5· uses extra data· 2023-03-28
74.8
best: 79.9 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonLSMDC
text-to-video R@1· uses extra data· 2023-03-28
25.2
best: 33.8 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonLSMDC
video-to-text R@1· uses extra data· 2023-03-28
23.2
best: 30.1 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonLSMDC
video-to-text R@10· uses extra data· 2023-03-28
44.2
best: 54.8 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonLSMDC
video-to-text R@5· uses extra data· 2023-03-28
37.7
best: 47.7 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonActivityNet
text-to-video R@1· uses extra data· 2023-03-28
42.8
best: 63.2 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonActivityNet
text-to-video R@10· uses extra data· 2023-03-28
79.8
best: 92.5 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonActivityNet
text-to-video R@5· uses extra data· 2023-03-28
69.6
best: 85.6 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonActivityNet
video-to-text R@1· uses extra data· 2023-03-28
40.7
best: 56.5 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonActivityNet
video-to-text R@10· uses extra data· 2023-03-28
78.6
best: 90.3 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonActivityNet
video-to-text R@5· uses extra data· 2023-03-28
67.6
best: 82.8 (InternVideo2-6B)
SOTA
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonSSv2-template retrieval
text-to-video R@10· uses extra data· 2023-03-28
100
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonSSv2-template retrieval
text-to-video R@5· uses extra data· 2023-03-28
100
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonMSR-VTT
video-to-text R@1· uses extra data· 2023-03-28
58.6
best: 64.8 (GRAM)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonMSR-VTT
video-to-text R@10· uses extra data· 2023-03-28
86.5
best: 92.8 (CAMoE)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonMSR-VTT
video-to-text R@5· uses extra data· 2023-03-28
81.6
best: 86.2 (CAMoE)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonLSMDC
text-to-video R@10· uses extra data· 2023-03-28
73
best: 92.8 (HunYuan_tvr (huge))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonLSMDC
text-to-video R@5· uses extra data· 2023-03-28
65.5
best: 80.1 (HunYuan_tvr (huge))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonLSMDC
video-to-text R@10· uses extra data· 2023-03-28
71.5
best: 91.8 (HunYuan_tvr (huge))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonLSMDC
video-to-text R@5· uses extra data· 2023-03-28
64.3
best: 71.8 (HunYuan_tvr (huge))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonKinetics-700
Top-1 Accuracy· uses extra data· 2023-03-28
83.6
best: 85.9 (InternVideo2-6B)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonKinetics-400
Acc@1· 2023-03-28
90.6
best: 93.6 (OmniVec2)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonKinetics-400
Acc@5· 2023-03-28
98.7
best: 98.9 (TubeViT-H (ImageNet-1k))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonKinetics-600
Top-1 Accuracy· uses extra data· 2023-03-28
90.5
best: 91.9 (InternVideo2-6B)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
VideoonKinetics-600
Top-5 Accuracy· uses extra data· 2023-03-28
98.8
best: 98.9 (TubeVit-H)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonSSv2-template retrieval
text-to-video R@10· uses extra data· 2023-03-28
100
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonSSv2-template retrieval
text-to-video R@5· uses extra data· 2023-03-28
100
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonMSR-VTT
video-to-text R@1· uses extra data· 2023-03-28
58.6
best: 64.8 (GRAM)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonMSR-VTT
video-to-text R@10· uses extra data· 2023-03-28
86.5
best: 92.8 (CAMoE)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonMSR-VTT
video-to-text R@5· uses extra data· 2023-03-28
81.6
best: 86.2 (CAMoE)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonLSMDC
text-to-video R@10· uses extra data· 2023-03-28
73
best: 92.8 (HunYuan_tvr (huge))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonLSMDC
text-to-video R@5· uses extra data· 2023-03-28
65.5
best: 80.1 (HunYuan_tvr (huge))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonLSMDC
video-to-text R@10· uses extra data· 2023-03-28
71.5
best: 91.8 (HunYuan_tvr (huge))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Video RetrievalonLSMDC
video-to-text R@5· uses extra data· 2023-03-28
64.3
best: 71.8 (HunYuan_tvr (huge))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSR-VTT
text-to-video R@1· uses extra data· 2023-03-28
42.6
best: 55.9 (InternVideo2-6B)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSR-VTT
text-to-video R@10· uses extra data· 2023-03-28
73.1
best: 85.1 (InternVideo2-6B)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSR-VTT
text-to-video R@5· uses extra data· 2023-03-28
64.4
best: 78.3 (InternVideo2-6B)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSR-VTT
video-to-text R@1· uses extra data· 2023-03-28
38.6
best: 53.7 (InternVideo2-6B)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonMSVD
text-to-video R@10· uses extra data· 2023-03-28
84.7
best: 89.6 (InternVideo2-6B)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonDiDeMo
text-to-video R@10· uses extra data· 2023-03-28
79
best: 85.1 (InternVideo2-1B)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonLSMDC
text-to-video R@10· uses extra data· 2023-03-28
50.5
best: 62.2 (InternVideo2-6B)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Zero-Shot Video RetrievalonLSMDC
text-to-video R@5· uses extra data· 2023-03-28
43
best: 55.9 (InternVideo2-6B)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058

Natural Language Processing2 results

Visual Question Answering (VQA)onMSRVTT-QA
Accuracy· uses extra data· 2023-03-28
0.471
best: 0.496 (VLAB)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058
Visual Question Answering (VQA)onMSVD-QA
Accuracy· uses extra data· 2023-03-28
0.552
best: 0.61 (VLAB)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058

Reasoning1 result

Video Question AnsweringonActivityNet-QA
Accuracy· uses extra data· 2023-03-28
47.9
best: 61.6 (Tarsier (34B))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058

Robots1 result

Activity RecognitiononAVA v2.2
mAP· uses extra data· 2023-03-28
39.8
best: 45.1 (LART (Hiera-H, K700 PT+FT))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058

Time Series1 result

Action RecognitiononAVA v2.2
mAP· uses extra data· 2023-03-28
39.8
best: 45.1 (LART (Hiera-H, K700 PT+FT))
Unmasked Teacher: Towards Training-Efficient Video Foundation Models arXiv:2303.16058