TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/InternVideo: General Video Foundation Models via Generativ...

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao

2022-12-06Zero-Shot Video Question AnswerVideo RetrievalAction ClassificationZero-Shot Video RetrievalSpatio-Temporal Action LocalizationVideo Question AnsweringContrastive LearningVideo UnderstandingAction RecognitionVisual Question Answering (VQA)Temporal Action LocalizationOpen Set Action Recognition
PaperPDFCode(official)Code

Abstract

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

Results

TaskDatasetMetricValueModel
VideoHACSAverage-mAP41.55InternVideo
VideoActivityNet-1.3mAP39InternVideo
VideoFineActionmAP17.57InternVideo
VideoTHUMOS’14Avg mAP (0.3:0.7)71.58ActionFormer (InternVideo features)
VideoVATEXtext-to-video R@171.1InternVideo
VideoVATEXvideo-to-text R@187.2InternVideo
VideoActivityNettext-to-video R@162.2InternVideo
VideoActivityNetvideo-to-text R@162.8InternVideo
VideoDiDeMotext-to-video R@157.9InternVideo
VideoDiDeMovideo-to-text R@159.1InternVideo
VideoMSR-VTTtext-to-video R@155.2InternVideo
VideoMSR-VTTvideo-to-text R@157.9InternVideo
VideoLSMDCtext-to-video R@134InternVideo
VideoLSMDCvideo-to-text R@134.9InternVideo
VideoMSVDtext-to-video R@158.4InternVideo
VideoMSVDvideo-to-text R@176.3InternVideo
VideoKinetics-700Top-1 Accuracy84InternVideo-T
VideoKinetics-400Acc@191.1InternVideo
VideoKinetics-600Top-1 Accuracy91.3InternVideo-T
Temporal Action LocalizationHACSAverage-mAP41.55InternVideo
Temporal Action LocalizationActivityNet-1.3mAP39InternVideo
Temporal Action LocalizationFineActionmAP17.57InternVideo
Temporal Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)71.58ActionFormer (InternVideo features)
Zero-Shot LearningHACSAverage-mAP41.55InternVideo
Zero-Shot LearningActivityNet-1.3mAP39InternVideo
Zero-Shot LearningFineActionmAP17.57InternVideo
Zero-Shot LearningTHUMOS’14Avg mAP (0.3:0.7)71.58ActionFormer (InternVideo features)
Question AnsweringSTAR BenchmarkAccuracy41.6InternVideo
Question AnsweringTVQAAccuracy35.9InternVideo (no speech)
Question AnsweringEgoSchema (fullset)Accuracy32.1InternVideo
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.471InternVideo
Visual Question Answering (VQA)MSVD-QAAccuracy0.555InternVideo
Visual Question Answering (VQA)TGIF-QAAccuracy0.722InternVideo
Video Question AnsweringSTAR BenchmarkAverage Accuracy58.7InternVideo
Video Question AnsweringSTAR BenchmarkAccuracy41.6InternVideo
Video Question AnsweringTVQAAccuracy35.9InternVideo (no speech)
Video Question AnsweringEgoSchema (fullset)Accuracy32.1InternVideo
Activity RecognitionSomething-Something V1Top 1 Accuracy70InternVideo
Activity RecognitionSomething-Something V2Top-1 Accuracy77.2InternVideo
Activity RecognitionAVA v2.2mAP41.01InternVideo
Activity RecognitionUCF101-MiTv2AUROC91.85InternVideo
Activity RecognitionUCF-HMDBAUROC85.48InternVideo
Action LocalizationHACSAverage-mAP41.55InternVideo
Action LocalizationActivityNet-1.3mAP39InternVideo
Action LocalizationFineActionmAP17.57InternVideo
Action LocalizationTHUMOS’14Avg mAP (0.3:0.7)71.58ActionFormer (InternVideo features)
Action LocalizationAVA-Kineticsval mAP41.01InternVideo
Action RecognitionSomething-Something V1Top 1 Accuracy70InternVideo
Action RecognitionSomething-Something V2Top-1 Accuracy77.2InternVideo
Action RecognitionAVA v2.2mAP41.01InternVideo
Action RecognitionUCF101-MiTv2AUROC91.85InternVideo
Action RecognitionUCF-HMDBAUROC85.48InternVideo
Video RetrievalVATEXtext-to-video R@171.1InternVideo
Video RetrievalVATEXvideo-to-text R@187.2InternVideo
Video RetrievalActivityNettext-to-video R@162.2InternVideo
Video RetrievalActivityNetvideo-to-text R@162.8InternVideo
Video RetrievalDiDeMotext-to-video R@157.9InternVideo
Video RetrievalDiDeMovideo-to-text R@159.1InternVideo
Video RetrievalMSR-VTTtext-to-video R@155.2InternVideo
Video RetrievalMSR-VTTvideo-to-text R@157.9InternVideo
Video RetrievalLSMDCtext-to-video R@134InternVideo
Video RetrievalLSMDCvideo-to-text R@134.9InternVideo
Video RetrievalMSVDtext-to-video R@158.4InternVideo
Video RetrievalMSVDvideo-to-text R@176.3InternVideo
Zero-Shot Video RetrievalVATEXtext-to-video R@149.5InternVideo
Zero-Shot Video RetrievalVATEXvideo-to-text R@169.5InternVideo
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@140.7InternVideo
Zero-Shot Video RetrievalMSR-VTTvideo-to-text R@139.6InternVideo
Zero-Shot Video RetrievalMSVDtext-to-video R@143.4InternVideo
Zero-Shot Video RetrievalMSVDvideo-to-text R@167.6InternVideo
Zero-Shot Video RetrievalDiDeMotext-to-video R@131.5InternVideo
Zero-Shot Video RetrievalDiDeMotext-to-video R@1068.2InternVideo
Zero-Shot Video RetrievalDiDeMotext-to-video R@557.6InternVideo
Zero-Shot Video RetrievalDiDeMovideo-to-text R@133.5InternVideo
Zero-Shot Video RetrievalDiDeMovideo-to-text R@1071.1InternVideo
Zero-Shot Video RetrievalDiDeMovideo-to-text R@560.3InternVideo
Zero-Shot Video RetrievalLSMDCtext-to-video R@117.6InternVideo
Zero-Shot Video RetrievalLSMDCtext-to-video R@1040.2InternVideo
Zero-Shot Video RetrievalLSMDCtext-to-video R@532.4InternVideo
Zero-Shot Video RetrievalLSMDCvideo-to-text R@113.2InternVideo
Zero-Shot Video RetrievalLSMDCvideo-to-text R@1034.9InternVideo
Zero-Shot Video RetrievalLSMDCvideo-to-text R@527.8InternVideo
Zero-Shot Video RetrievalActivityNettext-to-video R@130.7InternVideo
Zero-Shot Video RetrievalActivityNetvideo-to-text R@131.4InternVideo

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16