Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Models/InternVideo

InternVideo

Reported on 93 benchmarks across 15 tasks · 2 papers · 73 SOTA

Note: results are matched by exact model name. Different papers may use the same name for different model variants.

Computer Vision67 results

VideoonHACS
Average-mAP· 2022-12-06
41.55
best: 45.8 (RDFA-S6 (InternVideo2-6B))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonFineAction
mAP· 2022-12-06
17.57
best: 29.6 (RDFA-S6 (InternVideo2-6B))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonVATEX
text-to-video R@1· 2022-12-06
71.1
best: 87.7 (GRAM)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonVATEX
video-to-text R@1· 2022-12-06
87.2
best: 89.3 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonActivityNet
text-to-video R@1· uses extra data· 2022-12-06
62.2
best: 74.1 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonActivityNet
video-to-text R@1· uses extra data· 2022-12-06
62.8
best: 69.7 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonDiDeMo
text-to-video R@1· uses extra data· 2022-12-06
57.9
best: 74.2 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonDiDeMo
video-to-text R@1· uses extra data· 2022-12-06
59.1
best: 71.9 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonMSR-VTT
text-to-video R@1· uses extra data· 2022-12-06
55.2
best: 64 (GRAM)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonLSMDC
video-to-text R@1· uses extra data· 2022-12-06
34.9
best: 46.7 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonMSVD
video-to-text R@1· uses extra data· 2022-12-06
76.3
best: 85.2 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonKinetics-400
Acc@1· 2022-12-06
91.1
best: 93.6 (OmniVec2)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Temporal Action LocalizationonHACS
Average-mAP· 2022-12-06
41.55
best: 45.8 (RDFA-S6 (InternVideo2-6B))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Temporal Action LocalizationonFineAction
mAP· 2022-12-06
17.57
best: 29.6 (RDFA-S6 (InternVideo2-6B))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Action LocalizationonHACS
Average-mAP· 2022-12-06
41.55
best: 45.8 (RDFA-S6 (InternVideo2-6B))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Action LocalizationonFineAction
mAP· 2022-12-06
17.57
best: 29.6 (RDFA-S6 (InternVideo2-6B))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Action LocalizationonAVA-Kinetics
val mAP· uses extra data· 2022-12-06
41.01
best: 42.6 (VideoMAE V2-g)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonVATEX
text-to-video R@1· 2022-12-06
71.1
best: 87.7 (GRAM)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonVATEX
video-to-text R@1· 2022-12-06
87.2
best: 89.3 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonActivityNet
text-to-video R@1· uses extra data· 2022-12-06
62.2
best: 74.1 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonActivityNet
video-to-text R@1· uses extra data· 2022-12-06
62.8
best: 69.7 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonDiDeMo
text-to-video R@1· uses extra data· 2022-12-06
57.9
best: 74.2 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonDiDeMo
video-to-text R@1· uses extra data· 2022-12-06
59.1
best: 71.9 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonMSR-VTT
text-to-video R@1· uses extra data· 2022-12-06
55.2
best: 64 (GRAM)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonLSMDC
video-to-text R@1· uses extra data· 2022-12-06
34.9
best: 46.7 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonMSVD
video-to-text R@1· uses extra data· 2022-12-06
76.3
best: 85.2 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonVATEX
text-to-video R@1· 2022-12-06
49.5
best: 83.9 (GRAM)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonVATEX
video-to-text R@1· 2022-12-06
69.5
best: 85.4 (InternVideo2-1B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonMSR-VTT
text-to-video R@1· uses extra data· 2022-12-06
40.7
best: 55.9 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonMSR-VTT
video-to-text R@1· uses extra data· 2022-12-06
39.6
best: 53.7 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonMSVD
video-to-text R@1· uses extra data· 2022-12-06
67.6
best: 83.3 (InternVideo2-1B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonDiDeMo
video-to-text R@1· uses extra data· 2022-12-06
33.5
best: 57.1 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonDiDeMo
video-to-text R@10· uses extra data· 2022-12-06
71.1
best: 85 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonDiDeMo
video-to-text R@5· uses extra data· 2022-12-06
60.3
best: 79.9 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonLSMDC
text-to-video R@1· uses extra data· 2022-12-06
17.6
best: 33.8 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonLSMDC
text-to-video R@10· uses extra data· 2022-12-06
40.2
best: 62.2 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonLSMDC
video-to-text R@1· uses extra data· 2022-12-06
13.2
best: 30.1 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonLSMDC
video-to-text R@10· uses extra data· 2022-12-06
34.9
best: 54.8 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonLSMDC
video-to-text R@5· uses extra data· 2022-12-06
27.8
best: 47.7 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonActivityNet
video-to-text R@1· uses extra data· 2022-12-06
31.4
best: 56.5 (InternVideo2-6B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
State Change Object DetectiononEgo4D
AP· 2022-11-17
37.19
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
State Change Object DetectiononEgo4D
AP50· 2022-11-17
55.97
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
State Change Object DetectiononEgo4D
AP75· 2022-11-17
38.44
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Short-term Object Interaction AnticipationonEgo4D
Noun (Top5 mAP)· 2022-11-17
24.6
best: 34.886 (SOIA-DOD)
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Short-term Object Interaction AnticipationonEgo4D
Noun+TTC (Top5 mAP)· 2022-11-17
7.64
best: 12.41 (EgoVideo)
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Short-term Object Interaction AnticipationonEgo4D
Noun+Verb(Top5 mAP)· 2022-11-17
9.18
best: 17.614 (SOIA-DOD)
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Short-term Object Interaction AnticipationonEgo4D
Overall (Top5 mAP)· 2022-11-17
3.4
best: 7.21 (EgoVideo)
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Future Hand PredictiononEgo4D
C.Disp(Left)· 2022-11-17
53.33
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Future Hand PredictiononEgo4D
C.Disp(Right)· 2022-11-17
53.37
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Future Hand PredictiononEgo4D
Disp(Total)· 2022-11-17
196.8
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Future Hand PredictiononEgo4D
M.Disp(Left)· 2022-11-17
43.25
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Future Hand PredictiononEgo4D
M.Disp(Right)· 2022-11-17
46.25
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
VideoonActivityNet-1.3
mAP· uses extra data· 2022-12-06
39
best: 42.9 (RDFA-S6 (InternVideo2-6B))
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonMSR-VTT
video-to-text R@1· uses extra data· 2022-12-06
57.9
best: 64.8 (GRAM)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonLSMDC
text-to-video R@1· uses extra data· 2022-12-06
34
best: 46.4 (InternVideo2-6B)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
VideoonMSVD
text-to-video R@1· uses extra data· 2022-12-06
58.4
best: 61.4 (InternVideo2-6B)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Temporal Action LocalizationonActivityNet-1.3
mAP· uses extra data· 2022-12-06
39
best: 42.9 (RDFA-S6 (InternVideo2-6B))
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Action LocalizationonActivityNet-1.3
mAP· uses extra data· 2022-12-06
39
best: 42.9 (RDFA-S6 (InternVideo2-6B))
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonMSR-VTT
video-to-text R@1· uses extra data· 2022-12-06
57.9
best: 64.8 (GRAM)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonLSMDC
text-to-video R@1· uses extra data· 2022-12-06
34
best: 46.4 (InternVideo2-6B)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video RetrievalonMSVD
text-to-video R@1· uses extra data· 2022-12-06
58.4
best: 61.4 (InternVideo2-6B)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonMSVD
text-to-video R@1· uses extra data· 2022-12-06
43.4
best: 59.3 (InternVideo2-6B)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonDiDeMo
text-to-video R@1· uses extra data· 2022-12-06
31.5
best: 57.9 (InternVideo2-6B)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonDiDeMo
text-to-video R@10· uses extra data· 2022-12-06
68.2
best: 85.1 (InternVideo2-1B)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonDiDeMo
text-to-video R@5· uses extra data· 2022-12-06
57.6
best: 80 (InternVideo2-6B)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonLSMDC
text-to-video R@5· uses extra data· 2022-12-06
32.4
best: 55.9 (InternVideo2-6B)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot Video RetrievalonActivityNet
text-to-video R@1· uses extra data· 2022-12-06
30.7
best: 63.2 (InternVideo2-6B)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191

Natural Language Processing10 results

Question AnsweringonEgoSchema (fullset)
Accuracy· 2022-12-06
32.1
best: 71.14 (BIMBA-LLaVA-Qwen2-7B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Visual Question Answering (VQA)onTGIF-QA
Accuracy· 2022-12-06
0.722
best: 0.732 (HiTeA)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Natural Language QueriesonEgo4D
R@1 IoU=0.3· 2022-11-17
16.45
best: 28.05 (EgoVideo)
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Natural Language QueriesonEgo4D
R@1 IoU=0.5· 2022-11-17
10.06
best: 19.31 (EgoVideo)
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Natural Language QueriesonEgo4D
R@1 Mean(0.3 and 0.5)· 2022-11-17
13.26
best: 23.68 (EgoVideo)
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Natural Language QueriesonEgo4D
R@5 IoU=0.3· 2022-11-17
22.95
best: 45.63 (DeCafNet-100%)
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Natural Language QueriesonEgo4D
R@5 IoU=0.5· 2022-11-17
16.1
best: 33.93 (DeCafNet-100%)
SOTA
InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges arXiv:2211.09529
Question AnsweringonSTAR Benchmark
Accuracy· 2022-12-06
41.6
best: 59 (VideoChat2)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Visual Question Answering (VQA)onMSRVTT-QA
Accuracy· uses extra data· 2022-12-06
0.471
best: 0.496 (VLAB)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Visual Question Answering (VQA)onMSVD-QA
Accuracy· uses extra data· 2022-12-06
0.555
best: 0.61 (VLAB)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191

Robots5 results

Activity RecognitiononSomething-Something V1
Top 1 Accuracy· uses extra data· 2022-12-06
70
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Activity RecognitiononSomething-Something V2
Top-1 Accuracy· uses extra data· 2022-12-06
77.2
best: 77.3 (MVD (Kinetics400 pretrain, ViT-H, 16 frame))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Activity RecognitiononAVA v2.2
mAP· uses extra data· 2022-12-06
41.01
best: 45.1 (LART (Hiera-H, K700 PT+FT))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Activity RecognitiononUCF101-MiTv2
AUROC· 2022-12-06
91.85
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Activity RecognitiononUCF-HMDB
AUROC· 2022-12-06
85.48
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191

Time Series5 results

Action RecognitiononSomething-Something V1
Top 1 Accuracy· uses extra data· 2022-12-06
70
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Action RecognitiononSomething-Something V2
Top-1 Accuracy· uses extra data· 2022-12-06
77.2
best: 77.3 (MVD (Kinetics400 pretrain, ViT-H, 16 frame))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Action RecognitiononAVA v2.2
mAP· uses extra data· 2022-12-06
41.01
best: 45.1 (LART (Hiera-H, K700 PT+FT))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Action RecognitiononUCF101-MiTv2
AUROC· 2022-12-06
91.85
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Action RecognitiononUCF-HMDB
AUROC· 2022-12-06
85.48
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191

Methodology3 results

Zero-Shot LearningonHACS
Average-mAP· 2022-12-06
41.55
best: 45.8 (RDFA-S6 (InternVideo2-6B))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot LearningonFineAction
mAP· 2022-12-06
17.57
best: 29.6 (RDFA-S6 (InternVideo2-6B))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Zero-Shot LearningonActivityNet-1.3
mAP· uses extra data· 2022-12-06
39
best: 42.9 (RDFA-S6 (InternVideo2-6B))
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191

Reasoning3 results

Video Question AnsweringonSTAR Benchmark
Average Accuracy· 2022-12-06
58.7
best: 67.1 (VLAP (4 frames))
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video Question AnsweringonEgoSchema (fullset)
Accuracy· 2022-12-06
32.1
best: 71.14 (BIMBA-LLaVA-Qwen2-7B)
SOTA
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191
Video Question AnsweringonSTAR Benchmark
Accuracy· 2022-12-06
41.6
best: 59 (VideoChat2)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning arXiv:2212.03191