InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao

2022-12-06Zero-Shot Video Question Answer Video Retrieval Action Classification Zero-Shot Video Retrieval Spatio-Temporal Action Localization Video Question Answering Contrastive Learning Video Understanding Action Recognition Visual Question Answering (VQA)Temporal Action Localization Open Set Action Recognition

Paper PDF Code(official)Code

Abstract

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

Results

Task	Dataset	Metric	Value	Model
Video	HACS	Average-mAP	41.55	InternVideo
Video	ActivityNet-1.3	mAP	39	InternVideo
Video	FineAction	mAP	17.57	InternVideo
Video	THUMOS’14	Avg mAP (0.3:0.7)	71.58	ActionFormer (InternVideo features)
Video	VATEX	text-to-video R@1	71.1	InternVideo
Video	VATEX	video-to-text R@1	87.2	InternVideo
Video	ActivityNet	text-to-video R@1	62.2	InternVideo
Video	ActivityNet	video-to-text R@1	62.8	InternVideo
Video	DiDeMo	text-to-video R@1	57.9	InternVideo
Video	DiDeMo	video-to-text R@1	59.1	InternVideo
Video	MSR-VTT	text-to-video R@1	55.2	InternVideo
Video	MSR-VTT	video-to-text R@1	57.9	InternVideo
Video	LSMDC	text-to-video R@1	34	InternVideo
Video	LSMDC	video-to-text R@1	34.9	InternVideo
Video	MSVD	text-to-video R@1	58.4	InternVideo
Video	MSVD	video-to-text R@1	76.3	InternVideo
Video	Kinetics-700	Top-1 Accuracy	84	InternVideo-T
Video	Kinetics-400	Acc@1	91.1	InternVideo
Video	Kinetics-600	Top-1 Accuracy	91.3	InternVideo-T
Temporal Action Localization	HACS	Average-mAP	41.55	InternVideo
Temporal Action Localization	ActivityNet-1.3	mAP	39	InternVideo
Temporal Action Localization	FineAction	mAP	17.57	InternVideo
Temporal Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	71.58	ActionFormer (InternVideo features)
Zero-Shot Learning	HACS	Average-mAP	41.55	InternVideo
Zero-Shot Learning	ActivityNet-1.3	mAP	39	InternVideo
Zero-Shot Learning	FineAction	mAP	17.57	InternVideo
Zero-Shot Learning	THUMOS’14	Avg mAP (0.3:0.7)	71.58	ActionFormer (InternVideo features)
Question Answering	STAR Benchmark	Accuracy	41.6	InternVideo
Question Answering	TVQA	Accuracy	35.9	InternVideo (no speech)
Question Answering	EgoSchema (fullset)	Accuracy	32.1	InternVideo
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.471	InternVideo
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.555	InternVideo
Visual Question Answering (VQA)	TGIF-QA	Accuracy	0.722	InternVideo
Video Question Answering	STAR Benchmark	Average Accuracy	58.7	InternVideo
Video Question Answering	STAR Benchmark	Accuracy	41.6	InternVideo
Video Question Answering	TVQA	Accuracy	35.9	InternVideo (no speech)
Video Question Answering	EgoSchema (fullset)	Accuracy	32.1	InternVideo
Activity Recognition	Something-Something V1	Top 1 Accuracy	70	InternVideo
Activity Recognition	Something-Something V2	Top-1 Accuracy	77.2	InternVideo
Activity Recognition	AVA v2.2	mAP	41.01	InternVideo
Activity Recognition	UCF101-MiTv2	AUROC	91.85	InternVideo
Activity Recognition	UCF-HMDB	AUROC	85.48	InternVideo
Action Localization	HACS	Average-mAP	41.55	InternVideo
Action Localization	ActivityNet-1.3	mAP	39	InternVideo
Action Localization	FineAction	mAP	17.57	InternVideo
Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	71.58	ActionFormer (InternVideo features)
Action Localization	AVA-Kinetics	val mAP	41.01	InternVideo
Action Recognition	Something-Something V1	Top 1 Accuracy	70	InternVideo
Action Recognition	Something-Something V2	Top-1 Accuracy	77.2	InternVideo
Action Recognition	AVA v2.2	mAP	41.01	InternVideo
Action Recognition	UCF101-MiTv2	AUROC	91.85	InternVideo
Action Recognition	UCF-HMDB	AUROC	85.48	InternVideo
Video Retrieval	VATEX	text-to-video R@1	71.1	InternVideo
Video Retrieval	VATEX	video-to-text R@1	87.2	InternVideo
Video Retrieval	ActivityNet	text-to-video R@1	62.2	InternVideo
Video Retrieval	ActivityNet	video-to-text R@1	62.8	InternVideo
Video Retrieval	DiDeMo	text-to-video R@1	57.9	InternVideo
Video Retrieval	DiDeMo	video-to-text R@1	59.1	InternVideo
Video Retrieval	MSR-VTT	text-to-video R@1	55.2	InternVideo
Video Retrieval	MSR-VTT	video-to-text R@1	57.9	InternVideo
Video Retrieval	LSMDC	text-to-video R@1	34	InternVideo
Video Retrieval	LSMDC	video-to-text R@1	34.9	InternVideo
Video Retrieval	MSVD	text-to-video R@1	58.4	InternVideo
Video Retrieval	MSVD	video-to-text R@1	76.3	InternVideo
Zero-Shot Video Retrieval	VATEX	text-to-video R@1	49.5	InternVideo
Zero-Shot Video Retrieval	VATEX	video-to-text R@1	69.5	InternVideo
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	40.7	InternVideo
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@1	39.6	InternVideo
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	43.4	InternVideo
Zero-Shot Video Retrieval	MSVD	video-to-text R@1	67.6	InternVideo
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	31.5	InternVideo
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	68.2	InternVideo
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	57.6	InternVideo
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@1	33.5	InternVideo
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@10	71.1	InternVideo
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@5	60.3	InternVideo
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	17.6	InternVideo
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	40.2	InternVideo
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	32.4	InternVideo
Zero-Shot Video Retrieval	LSMDC	video-to-text R@1	13.2	InternVideo
Zero-Shot Video Retrieval	LSMDC	video-to-text R@10	34.9	InternVideo
Zero-Shot Video Retrieval	LSMDC	video-to-text R@5	27.8	InternVideo
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@1	30.7	InternVideo
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@1	31.4	InternVideo

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	HACS	Average-mAP	41.55	InternVideo
Video	ActivityNet-1.3	mAP	39	InternVideo
Video	FineAction	mAP	17.57	InternVideo
Video	THUMOS’14	Avg mAP (0.3:0.7)	71.58	ActionFormer (InternVideo features)
Video	VATEX	text-to-video R@1	71.1	InternVideo
Video	VATEX	video-to-text R@1	87.2	InternVideo
Video	ActivityNet	text-to-video R@1	62.2	InternVideo
Video	ActivityNet	video-to-text R@1	62.8	InternVideo
Video	DiDeMo	text-to-video R@1	57.9	InternVideo
Video	DiDeMo	video-to-text R@1	59.1	InternVideo
Video	MSR-VTT	text-to-video R@1	55.2	InternVideo
Video	MSR-VTT	video-to-text R@1	57.9	InternVideo
Video	LSMDC	text-to-video R@1	34	InternVideo
Video	LSMDC	video-to-text R@1	34.9	InternVideo
Video	MSVD	text-to-video R@1	58.4	InternVideo
Video	MSVD	video-to-text R@1	76.3	InternVideo
Video	Kinetics-700	Top-1 Accuracy	84	InternVideo-T
Video	Kinetics-400	Acc@1	91.1	InternVideo
Video	Kinetics-600	Top-1 Accuracy	91.3	InternVideo-T
Temporal Action Localization	HACS	Average-mAP	41.55	InternVideo
Temporal Action Localization	ActivityNet-1.3	mAP	39	InternVideo
Temporal Action Localization	FineAction	mAP	17.57	InternVideo
Temporal Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	71.58	ActionFormer (InternVideo features)
Zero-Shot Learning	HACS	Average-mAP	41.55	InternVideo
Zero-Shot Learning	ActivityNet-1.3	mAP	39	InternVideo
Zero-Shot Learning	FineAction	mAP	17.57	InternVideo
Zero-Shot Learning	THUMOS’14	Avg mAP (0.3:0.7)	71.58	ActionFormer (InternVideo features)
Question Answering	STAR Benchmark	Accuracy	41.6	InternVideo
Question Answering	TVQA	Accuracy	35.9	InternVideo (no speech)
Question Answering	EgoSchema (fullset)	Accuracy	32.1	InternVideo
Visual Question Answering (VQA)	MSRVTT-QA	Accuracy	0.471	InternVideo
Visual Question Answering (VQA)	MSVD-QA	Accuracy	0.555	InternVideo
Visual Question Answering (VQA)	TGIF-QA	Accuracy	0.722	InternVideo
Video Question Answering	STAR Benchmark	Average Accuracy	58.7	InternVideo
Video Question Answering	STAR Benchmark	Accuracy	41.6	InternVideo
Video Question Answering	TVQA	Accuracy	35.9	InternVideo (no speech)
Video Question Answering	EgoSchema (fullset)	Accuracy	32.1	InternVideo
Activity Recognition	Something-Something V1	Top 1 Accuracy	70	InternVideo
Activity Recognition	Something-Something V2	Top-1 Accuracy	77.2	InternVideo
Activity Recognition	AVA v2.2	mAP	41.01	InternVideo
Activity Recognition	UCF101-MiTv2	AUROC	91.85	InternVideo
Activity Recognition	UCF-HMDB	AUROC	85.48	InternVideo
Action Localization	HACS	Average-mAP	41.55	InternVideo
Action Localization	ActivityNet-1.3	mAP	39	InternVideo
Action Localization	FineAction	mAP	17.57	InternVideo
Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	71.58	ActionFormer (InternVideo features)
Action Localization	AVA-Kinetics	val mAP	41.01	InternVideo
Action Recognition	Something-Something V1	Top 1 Accuracy	70	InternVideo
Action Recognition	Something-Something V2	Top-1 Accuracy	77.2	InternVideo
Action Recognition	AVA v2.2	mAP	41.01	InternVideo
Action Recognition	UCF101-MiTv2	AUROC	91.85	InternVideo
Action Recognition	UCF-HMDB	AUROC	85.48	InternVideo
Video Retrieval	VATEX	text-to-video R@1	71.1	InternVideo
Video Retrieval	VATEX	video-to-text R@1	87.2	InternVideo
Video Retrieval	ActivityNet	text-to-video R@1	62.2	InternVideo
Video Retrieval	ActivityNet	video-to-text R@1	62.8	InternVideo
Video Retrieval	DiDeMo	text-to-video R@1	57.9	InternVideo
Video Retrieval	DiDeMo	video-to-text R@1	59.1	InternVideo
Video Retrieval	MSR-VTT	text-to-video R@1	55.2	InternVideo
Video Retrieval	MSR-VTT	video-to-text R@1	57.9	InternVideo
Video Retrieval	LSMDC	text-to-video R@1	34	InternVideo
Video Retrieval	LSMDC	video-to-text R@1	34.9	InternVideo
Video Retrieval	MSVD	text-to-video R@1	58.4	InternVideo
Video Retrieval	MSVD	video-to-text R@1	76.3	InternVideo
Zero-Shot Video Retrieval	VATEX	text-to-video R@1	49.5	InternVideo
Zero-Shot Video Retrieval	VATEX	video-to-text R@1	69.5	InternVideo
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	40.7	InternVideo
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@1	39.6	InternVideo
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	43.4	InternVideo
Zero-Shot Video Retrieval	MSVD	video-to-text R@1	67.6	InternVideo
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	31.5	InternVideo
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	68.2	InternVideo
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	57.6	InternVideo
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@1	33.5	InternVideo
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@10	71.1	InternVideo
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@5	60.3	InternVideo
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	17.6	InternVideo
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	40.2	InternVideo
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	32.4	InternVideo
Zero-Shot Video Retrieval	LSMDC	video-to-text R@1	13.2	InternVideo
Zero-Shot Video Retrieval	LSMDC	video-to-text R@10	34.9	InternVideo
Zero-Shot Video Retrieval	LSMDC	video-to-text R@5	27.8	InternVideo
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@1	30.7	InternVideo
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@1	31.4	InternVideo

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Abstract

Results

Related Papers

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Abstract

Results

Related Papers