TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/InternVideo2: Scaling Foundation Models for Multimodal Vid...

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang

2024-03-22Zero-Shot Video Question AnswerText to Audio RetrievalVideo RetrievalAction ClassificationAudio ClassificationVideo GroundingZero-Shot Video RetrievalVideo RecognitionVideo Question AnsweringContrastive LearningMoment RetrievalVideo UnderstandingAction RecognitionTemporal Action LocalizationVideo Instance SegmentationZero-shot Text to Audio Retrieval
PaperPDFCode(official)Code(official)

Abstract

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate superior performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/.

Results

TaskDatasetMetricValueModel
VideoHACSAverage-mAP43.3InternVideo2-6B
VideoHACSAverage-mAP42.4InternVideo2-1B
VideoActivityNet-1.3mAP41.2InternVideo2-6B
VideoActivityNet-1.3mAP40.4InternVideo2-1B
VideoFineActionmAP27.7InternVideo2-6B
VideoTHUMOS’14Avg mAP (0.3:0.7)72InternVideo2-6B
VideoTHUMOS’14Avg mAP (0.3:0.7)69.8InternVideo2-1B
VideoVATEXtext-to-video R@175.5InternVideo2-6B
VideoVATEXvideo-to-text R@189.3InternVideo2-6B
VideoActivityNettext-to-video R@174.1InternVideo2-6B
VideoActivityNetvideo-to-text R@169.7InternVideo2-6B
VideoDiDeMotext-to-video R@174.2InternVideo2-6B
VideoDiDeMovideo-to-text R@171.9InternVideo2-6B
VideoMSR-VTTtext-to-video R@162.8InternVideo2-6B
VideoMSR-VTTvideo-to-text R@160.2InternVideo2-6B
VideoLSMDCtext-to-video R@146.4InternVideo2-6B
VideoLSMDCvideo-to-text R@146.7InternVideo2-6B
VideoMSVDtext-to-video R@161.4InternVideo2-6B
VideoMSVDvideo-to-text R@185.2InternVideo2-6B
VideoQVHighlightsR@1,IoU=0.571.42InternVideo2-6B
VideoQVHighlightsR@1,IoU=0.756.45InternVideo2-6B
VideoQVHighlightsR@1,IoU=0.570InternVideo2-1B
VideoQVHighlightsR@1,IoU=0.754.45InternVideo2-1B
VideoKinetics-700Top-1 Accuracy85.9InternVideo2-6B
VideoKinetics-700Top-1 Accuracy85.4InternVideo2-1B
VideoMiTTop 1 Accuracy50.9InternVideo2-1B
VideoKinetics-400Acc@192.1InternVideo2-6B
VideoKinetics-400Acc@191.6InternVideo2-1B
VideoKinetics-600Top-1 Accuracy91.9InternVideo2-6B
VideoKinetics-600Top-1 Accuracy91.6InternVideo2-1B
VideoMITTop 1 Accuracy51.2InternVideo2-6B
Temporal Action LocalizationHACSAverage-mAP43.3InternVideo2-6B
Temporal Action LocalizationHACSAverage-mAP42.4InternVideo2-1B
Temporal Action LocalizationActivityNet-1.3mAP41.2InternVideo2-6B
Temporal Action LocalizationActivityNet-1.3mAP40.4InternVideo2-1B
Temporal Action LocalizationFineActionmAP27.7InternVideo2-6B
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)72InternVideo2-6B
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)69.8InternVideo2-1B
Zero-Shot LearningHACSAverage-mAP43.3InternVideo2-6B
Zero-Shot LearningHACSAverage-mAP42.4InternVideo2-1B
Zero-Shot LearningActivityNet-1.3mAP41.2InternVideo2-6B
Zero-Shot LearningActivityNet-1.3mAP40.4InternVideo2-1B
Zero-Shot LearningFineActionmAP27.7InternVideo2-6B
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)72InternVideo2-6B
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)69.8InternVideo2-1B
Question AnsweringMVBenchAccuracy60.9InternVideo2-1B
Question AnsweringEgoSchema (fullset)Accuracy60.2InternVideo2-6B
Video Question AnsweringPerception TestAccuracy (Top-1)63.4InternVideo2 (8B)
Video Question AnsweringMVBenchAvg.67.2InternVideo2
Video Question AnsweringMVBenchAccuracy60.9InternVideo2-1B
Video Question AnsweringEgoSchema (fullset)Accuracy60.2InternVideo2-6B
Activity RecognitionHACSTop 1 Accuracy97InternVideo2-6B
Activity RecognitionSomething-Something V2Top-1 Accuracy77.1InternVideo2-1B
Activity RecognitionSomething-Something V2GFLOPs13321InternVideo2-6B
Activity RecognitionSomething-Something V2Parameters2131InternVideo2-6B
Activity RecognitionSomething-Something V2Top-1 Accuracy1InternVideo2-6B
Activity RecognitionSomething-Something V2Top-5 Accuracy12InternVideo2-6B
Activity RecognitionActivityNetmAP95.9InternVideo2-6B
Action LocalizationHACSAverage-mAP43.3InternVideo2-6B
Action LocalizationHACSAverage-mAP42.4InternVideo2-1B
Action LocalizationActivityNet-1.3mAP41.2InternVideo2-6B
Action LocalizationActivityNet-1.3mAP40.4InternVideo2-1B
Action LocalizationFineActionmAP27.7InternVideo2-6B
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)72InternVideo2-6B
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)69.8InternVideo2-1B
Audio ClassificationESC-50Accuracy (5-fold)98.6InternVideo2
Audio ClassificationESC-50Top-1 Accuracy98.6InternVideo2
Action RecognitionHACSTop 1 Accuracy97InternVideo2-6B
Action RecognitionSomething-Something V2Top-1 Accuracy77.1InternVideo2-1B
Action RecognitionSomething-Something V2GFLOPs13321InternVideo2-6B
Action RecognitionSomething-Something V2Parameters2131InternVideo2-6B
Action RecognitionSomething-Something V2Top-1 Accuracy1InternVideo2-6B
Action RecognitionSomething-Something V2Top-5 Accuracy12InternVideo2-6B
Action RecognitionActivityNetmAP95.9InternVideo2-6B
Video RetrievalVATEXtext-to-video R@175.5InternVideo2-6B
Video RetrievalVATEXvideo-to-text R@189.3InternVideo2-6B
Video RetrievalActivityNettext-to-video R@174.1InternVideo2-6B
Video RetrievalActivityNetvideo-to-text R@169.7InternVideo2-6B
Video RetrievalDiDeMotext-to-video R@174.2InternVideo2-6B
Video RetrievalDiDeMovideo-to-text R@171.9InternVideo2-6B
Video RetrievalMSR-VTTtext-to-video R@162.8InternVideo2-6B
Video RetrievalMSR-VTTvideo-to-text R@160.2InternVideo2-6B
Video RetrievalLSMDCtext-to-video R@146.4InternVideo2-6B
Video RetrievalLSMDCvideo-to-text R@146.7InternVideo2-6B
Video RetrievalMSVDtext-to-video R@161.4InternVideo2-6B
Video RetrievalMSVDvideo-to-text R@185.2InternVideo2-6B
Video RetrievalQVHighlightsR@1,IoU=0.571.42InternVideo2-6B
Video RetrievalQVHighlightsR@1,IoU=0.756.45InternVideo2-6B
Video RetrievalQVHighlightsR@1,IoU=0.570InternVideo2-1B
Video RetrievalQVHighlightsR@1,IoU=0.754.45InternVideo2-1B
Moment RetrievalCharades-STAR@1 IoU=0.570.03InternVideo2-6B
Moment RetrievalCharades-STAR@1 IoU=0.748.95InternVideo2-6B
Moment RetrievalCharades-STAR@1 IoU=0.568.36InternVideo2-1B
Moment RetrievalCharades-STAR@1 IoU=0.745.03InternVideo2-1B
Moment RetrievalQVHighlightsR@1 IoU=0.571.42InternVideo2-6B
Moment RetrievalQVHighlightsR@1 IoU=0.756.45InternVideo2-6B
Moment RetrievalQVHighlightsmAP49.24InternVideo2-6B
ClassificationESC-50Accuracy (5-fold)98.6InternVideo2
ClassificationESC-50Top-1 Accuracy98.6InternVideo2
Video GroundingQVHighlightsR@1,IoU=0.571.42InternVideo2-6B
Video GroundingQVHighlightsR@1,IoU=0.756.45InternVideo2-6B
Video GroundingQVHighlightsR@1,IoU=0.570InternVideo2-1B
Video GroundingQVHighlightsR@1,IoU=0.754.45InternVideo2-1B
Text to Audio RetrievalAudioCapsR@155.2InternVideo2-6B
Text to Audio RetrievalClothoR@127.2InternVideo2-6B
Zero-Shot Video RetrievalVATEXtext-to-video R@171.5InternVideo2-6B
Zero-Shot Video RetrievalVATEXtext-to-video R@1097.1InternVideo2-6B
Zero-Shot Video RetrievalVATEXtext-to-video R@594InternVideo2-6B
Zero-Shot Video RetrievalVATEXvideo-to-text R@185.3InternVideo2-6B
Zero-Shot Video RetrievalVATEXvideo-to-text R@1099.3InternVideo2-6B
Zero-Shot Video RetrievalVATEXvideo-to-text R@597.9InternVideo2-6B
Zero-Shot Video RetrievalVATEXtext-to-video R@170.4InternVideo2-1B
Zero-Shot Video RetrievalVATEXtext-to-video R@1096.9InternVideo2-1B
Zero-Shot Video RetrievalVATEXtext-to-video R@593.4InternVideo2-1B
Zero-Shot Video RetrievalVATEXvideo-to-text R@185.4InternVideo2-1B
Zero-Shot Video RetrievalVATEXvideo-to-text R@1099.1InternVideo2-1B
Zero-Shot Video RetrievalVATEXvideo-to-text R@597.6InternVideo2-1B
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@155.9InternVideo2-6B
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1085.1InternVideo2-6B
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@578.3InternVideo2-6B
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@153.7InternVideo2-6B
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@1084.1InternVideo2-6B
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@577.5InternVideo2-6B
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@151.9InternVideo2-1B
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1082.5InternVideo2-1B
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@575.3InternVideo2-1B
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@150.9InternVideo2-1B
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@1081.8InternVideo2-1B
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@573.4InternVideo2-1B
Zero-Shot Video RetrievalMSVDtext-to-video R@159.3InternVideo2-6B
Zero-Shot Video RetrievalMSVDtext-to-video R@1089.6InternVideo2-6B
Zero-Shot Video RetrievalMSVDtext-to-video R@584.4InternVideo2-6B
Zero-Shot Video RetrievalMSVDvideo-to-text R@183.1InternVideo2-6B
Zero-Shot Video RetrievalMSVDvideo-to-text R@1097InternVideo2-6B
Zero-Shot Video RetrievalMSVDvideo-to-text R@594.2InternVideo2-6B
Zero-Shot Video RetrievalMSVDtext-to-video R@158.1InternVideo2-1B
Zero-Shot Video RetrievalMSVDtext-to-video R@1088.4InternVideo2-1B
Zero-Shot Video RetrievalMSVDtext-to-video R@583InternVideo2-1B
Zero-Shot Video RetrievalMSVDvideo-to-text R@183.3InternVideo2-1B
Zero-Shot Video RetrievalMSVDvideo-to-text R@1096.9InternVideo2-1B
Zero-Shot Video RetrievalMSVDvideo-to-text R@594.3InternVideo2-1B
Zero-Shot Video RetrievalDiDeMotext-to-video R@157.9InternVideo2-6B
Zero-Shot Video RetrievalDiDeMotext-to-video R@1084.6InternVideo2-6B
Zero-Shot Video RetrievalDiDeMotext-to-video R@580InternVideo2-6B
Zero-Shot Video RetrievalDiDeMovideo-to-text R@157.1InternVideo2-6B
Zero-Shot Video RetrievalDiDeMovideo-to-text R@1085InternVideo2-6B
Zero-Shot Video RetrievalDiDeMovideo-to-text R@579.9InternVideo2-6B
Zero-Shot Video RetrievalDiDeMotext-to-video R@157InternVideo2-1B
Zero-Shot Video RetrievalDiDeMotext-to-video R@1085.1InternVideo2-1B
Zero-Shot Video RetrievalDiDeMotext-to-video R@580InternVideo2-1B
Zero-Shot Video RetrievalDiDeMovideo-to-text R@154.3InternVideo2-1B
Zero-Shot Video RetrievalDiDeMovideo-to-text R@1083.5InternVideo2-1B
Zero-Shot Video RetrievalDiDeMovideo-to-text R@577.2InternVideo2-1B
Zero-Shot Video RetrievalLSMDCtext-to-video R@133.8InternVideo2-6B
Zero-Shot Video RetrievalLSMDCtext-to-video R@1062.2InternVideo2-6B
Zero-Shot Video RetrievalLSMDCtext-to-video R@555.9InternVideo2-6B
Zero-Shot Video RetrievalLSMDCvideo-to-text R@130.1InternVideo2-6B
Zero-Shot Video RetrievalLSMDCvideo-to-text R@1054.8InternVideo2-6B
Zero-Shot Video RetrievalLSMDCvideo-to-text R@547.7InternVideo2-6B
Zero-Shot Video RetrievalLSMDCtext-to-video R@132InternVideo2-1B
Zero-Shot Video RetrievalLSMDCtext-to-video R@1059.4InternVideo2-1B
Zero-Shot Video RetrievalLSMDCtext-to-video R@552.4InternVideo2-1B
Zero-Shot Video RetrievalLSMDCvideo-to-text R@127.3InternVideo2-1B
Zero-Shot Video RetrievalLSMDCvideo-to-text R@1051.6InternVideo2-1B
Zero-Shot Video RetrievalLSMDCvideo-to-text R@544.2InternVideo2-1B
Zero-Shot Video RetrievalActivityNettext-to-video R@163.2InternVideo2-6B
Zero-Shot Video RetrievalActivityNettext-to-video R@1092.5InternVideo2-6B
Zero-Shot Video RetrievalActivityNettext-to-video R@585.6InternVideo2-6B
Zero-Shot Video RetrievalActivityNetvideo-to-text R@156.5InternVideo2-6B
Zero-Shot Video RetrievalActivityNetvideo-to-text R@1090.3InternVideo2-6B
Zero-Shot Video RetrievalActivityNetvideo-to-text R@582.8InternVideo2-6B
Zero-Shot Video RetrievalActivityNettext-to-video R@160.4InternVideo2-1B
Zero-Shot Video RetrievalActivityNettext-to-video R@1090.8InternVideo2-1B
Zero-Shot Video RetrievalActivityNettext-to-video R@583.9InternVideo2-1B
Zero-Shot Video RetrievalActivityNetvideo-to-text R@154.8InternVideo2-1B
Zero-Shot Video RetrievalActivityNetvideo-to-text R@1089.5InternVideo2-1B
Zero-Shot Video RetrievalActivityNetvideo-to-text R@581.5InternVideo2-1B

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17