InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang

2024-03-22Zero-Shot Video Question Answer Text to Audio Retrieval Video Retrieval Action Classification Audio Classification Video Grounding Zero-Shot Video Retrieval Video Recognition Video Question Answering Contrastive Learning Moment Retrieval Video Understanding Action Recognition Temporal Action Localization Video Instance Segmentation Zero-shot Text to Audio Retrieval

Paper PDF Code(official)Code(official)

Abstract

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate superior performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/.

Results

Task	Dataset	Metric	Value	Model
Video	HACS	Average-mAP	43.3	InternVideo2-6B
Video	HACS	Average-mAP	42.4	InternVideo2-1B
Video	ActivityNet-1.3	mAP	41.2	InternVideo2-6B
Video	ActivityNet-1.3	mAP	40.4	InternVideo2-1B
Video	FineAction	mAP	27.7	InternVideo2-6B
Video	THUMOS’14	Avg mAP (0.3:0.7)	72	InternVideo2-6B
Video	THUMOS’14	Avg mAP (0.3:0.7)	69.8	InternVideo2-1B
Video	VATEX	text-to-video R@1	75.5	InternVideo2-6B
Video	VATEX	video-to-text R@1	89.3	InternVideo2-6B
Video	ActivityNet	text-to-video R@1	74.1	InternVideo2-6B
Video	ActivityNet	video-to-text R@1	69.7	InternVideo2-6B
Video	DiDeMo	text-to-video R@1	74.2	InternVideo2-6B
Video	DiDeMo	video-to-text R@1	71.9	InternVideo2-6B
Video	MSR-VTT	text-to-video R@1	62.8	InternVideo2-6B
Video	MSR-VTT	video-to-text R@1	60.2	InternVideo2-6B
Video	LSMDC	text-to-video R@1	46.4	InternVideo2-6B
Video	LSMDC	video-to-text R@1	46.7	InternVideo2-6B
Video	MSVD	text-to-video R@1	61.4	InternVideo2-6B
Video	MSVD	video-to-text R@1	85.2	InternVideo2-6B
Video	QVHighlights	R@1,IoU=0.5	71.42	InternVideo2-6B
Video	QVHighlights	R@1,IoU=0.7	56.45	InternVideo2-6B
Video	QVHighlights	R@1,IoU=0.5	70	InternVideo2-1B
Video	QVHighlights	R@1,IoU=0.7	54.45	InternVideo2-1B
Video	Kinetics-700	Top-1 Accuracy	85.9	InternVideo2-6B
Video	Kinetics-700	Top-1 Accuracy	85.4	InternVideo2-1B
Video	MiT	Top 1 Accuracy	50.9	InternVideo2-1B
Video	Kinetics-400	Acc@1	92.1	InternVideo2-6B
Video	Kinetics-400	Acc@1	91.6	InternVideo2-1B
Video	Kinetics-600	Top-1 Accuracy	91.9	InternVideo2-6B
Video	Kinetics-600	Top-1 Accuracy	91.6	InternVideo2-1B
Video	MIT	Top 1 Accuracy	51.2	InternVideo2-6B
Temporal Action Localization	HACS	Average-mAP	43.3	InternVideo2-6B
Temporal Action Localization	HACS	Average-mAP	42.4	InternVideo2-1B
Temporal Action Localization	ActivityNet-1.3	mAP	41.2	InternVideo2-6B
Temporal Action Localization	ActivityNet-1.3	mAP	40.4	InternVideo2-1B
Temporal Action Localization	FineAction	mAP	27.7	InternVideo2-6B
Temporal Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	72	InternVideo2-6B
Temporal Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	69.8	InternVideo2-1B
Zero-Shot Learning	HACS	Average-mAP	43.3	InternVideo2-6B
Zero-Shot Learning	HACS	Average-mAP	42.4	InternVideo2-1B
Zero-Shot Learning	ActivityNet-1.3	mAP	41.2	InternVideo2-6B
Zero-Shot Learning	ActivityNet-1.3	mAP	40.4	InternVideo2-1B
Zero-Shot Learning	FineAction	mAP	27.7	InternVideo2-6B
Zero-Shot Learning	THUMOS’14	Avg mAP (0.3:0.7)	72	InternVideo2-6B
Zero-Shot Learning	THUMOS’14	Avg mAP (0.3:0.7)	69.8	InternVideo2-1B
Question Answering	MVBench	Accuracy	60.9	InternVideo2-1B
Question Answering	EgoSchema (fullset)	Accuracy	60.2	InternVideo2-6B
Video Question Answering	Perception Test	Accuracy (Top-1)	63.4	InternVideo2 (8B)
Video Question Answering	MVBench	Avg.	67.2	InternVideo2
Video Question Answering	MVBench	Accuracy	60.9	InternVideo2-1B
Video Question Answering	EgoSchema (fullset)	Accuracy	60.2	InternVideo2-6B
Activity Recognition	HACS	Top 1 Accuracy	97	InternVideo2-6B
Activity Recognition	Something-Something V2	Top-1 Accuracy	77.1	InternVideo2-1B
Activity Recognition	Something-Something V2	GFLOPs	13321	InternVideo2-6B
Activity Recognition	Something-Something V2	Parameters	2131	InternVideo2-6B
Activity Recognition	Something-Something V2	Top-1 Accuracy	1	InternVideo2-6B
Activity Recognition	Something-Something V2	Top-5 Accuracy	12	InternVideo2-6B
Activity Recognition	ActivityNet	mAP	95.9	InternVideo2-6B
Action Localization	HACS	Average-mAP	43.3	InternVideo2-6B
Action Localization	HACS	Average-mAP	42.4	InternVideo2-1B
Action Localization	ActivityNet-1.3	mAP	41.2	InternVideo2-6B
Action Localization	ActivityNet-1.3	mAP	40.4	InternVideo2-1B
Action Localization	FineAction	mAP	27.7	InternVideo2-6B
Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	72	InternVideo2-6B
Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	69.8	InternVideo2-1B
Audio Classification	ESC-50	Accuracy (5-fold)	98.6	InternVideo2
Audio Classification	ESC-50	Top-1 Accuracy	98.6	InternVideo2
Action Recognition	HACS	Top 1 Accuracy	97	InternVideo2-6B
Action Recognition	Something-Something V2	Top-1 Accuracy	77.1	InternVideo2-1B
Action Recognition	Something-Something V2	GFLOPs	13321	InternVideo2-6B
Action Recognition	Something-Something V2	Parameters	2131	InternVideo2-6B
Action Recognition	Something-Something V2	Top-1 Accuracy	1	InternVideo2-6B
Action Recognition	Something-Something V2	Top-5 Accuracy	12	InternVideo2-6B
Action Recognition	ActivityNet	mAP	95.9	InternVideo2-6B
Video Retrieval	VATEX	text-to-video R@1	75.5	InternVideo2-6B
Video Retrieval	VATEX	video-to-text R@1	89.3	InternVideo2-6B
Video Retrieval	ActivityNet	text-to-video R@1	74.1	InternVideo2-6B
Video Retrieval	ActivityNet	video-to-text R@1	69.7	InternVideo2-6B
Video Retrieval	DiDeMo	text-to-video R@1	74.2	InternVideo2-6B
Video Retrieval	DiDeMo	video-to-text R@1	71.9	InternVideo2-6B
Video Retrieval	MSR-VTT	text-to-video R@1	62.8	InternVideo2-6B
Video Retrieval	MSR-VTT	video-to-text R@1	60.2	InternVideo2-6B
Video Retrieval	LSMDC	text-to-video R@1	46.4	InternVideo2-6B
Video Retrieval	LSMDC	video-to-text R@1	46.7	InternVideo2-6B
Video Retrieval	MSVD	text-to-video R@1	61.4	InternVideo2-6B
Video Retrieval	MSVD	video-to-text R@1	85.2	InternVideo2-6B
Video Retrieval	QVHighlights	R@1,IoU=0.5	71.42	InternVideo2-6B
Video Retrieval	QVHighlights	R@1,IoU=0.7	56.45	InternVideo2-6B
Video Retrieval	QVHighlights	R@1,IoU=0.5	70	InternVideo2-1B
Video Retrieval	QVHighlights	R@1,IoU=0.7	54.45	InternVideo2-1B
Moment Retrieval	Charades-STA	R@1 IoU=0.5	70.03	InternVideo2-6B
Moment Retrieval	Charades-STA	R@1 IoU=0.7	48.95	InternVideo2-6B
Moment Retrieval	Charades-STA	R@1 IoU=0.5	68.36	InternVideo2-1B
Moment Retrieval	Charades-STA	R@1 IoU=0.7	45.03	InternVideo2-1B
Moment Retrieval	QVHighlights	R@1 IoU=0.5	71.42	InternVideo2-6B
Moment Retrieval	QVHighlights	R@1 IoU=0.7	56.45	InternVideo2-6B
Moment Retrieval	QVHighlights	mAP	49.24	InternVideo2-6B
Classification	ESC-50	Accuracy (5-fold)	98.6	InternVideo2
Classification	ESC-50	Top-1 Accuracy	98.6	InternVideo2
Video Grounding	QVHighlights	R@1,IoU=0.5	71.42	InternVideo2-6B
Video Grounding	QVHighlights	R@1,IoU=0.7	56.45	InternVideo2-6B
Video Grounding	QVHighlights	R@1,IoU=0.5	70	InternVideo2-1B
Video Grounding	QVHighlights	R@1,IoU=0.7	54.45	InternVideo2-1B
Text to Audio Retrieval	AudioCaps	R@1	55.2	InternVideo2-6B
Text to Audio Retrieval	Clotho	R@1	27.2	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	text-to-video R@1	71.5	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	text-to-video R@10	97.1	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	text-to-video R@5	94	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	video-to-text R@1	85.3	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	video-to-text R@10	99.3	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	video-to-text R@5	97.9	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	text-to-video R@1	70.4	InternVideo2-1B
Zero-Shot Video Retrieval	VATEX	text-to-video R@10	96.9	InternVideo2-1B
Zero-Shot Video Retrieval	VATEX	text-to-video R@5	93.4	InternVideo2-1B
Zero-Shot Video Retrieval	VATEX	video-to-text R@1	85.4	InternVideo2-1B
Zero-Shot Video Retrieval	VATEX	video-to-text R@10	99.1	InternVideo2-1B
Zero-Shot Video Retrieval	VATEX	video-to-text R@5	97.6	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	55.9	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	85.1	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	78.3	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@1	53.7	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@10	84.1	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@5	77.5	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	51.9	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	82.5	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	75.3	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@1	50.9	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@10	81.8	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@5	73.4	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	59.3	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	89.6	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	84.4	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	video-to-text R@1	83.1	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	video-to-text R@10	97	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	video-to-text R@5	94.2	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	58.1	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	88.4	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	83	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	video-to-text R@1	83.3	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	video-to-text R@10	96.9	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	video-to-text R@5	94.3	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	57.9	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	84.6	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	80	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@1	57.1	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@10	85	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@5	79.9	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	57	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	85.1	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	80	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@1	54.3	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@10	83.5	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@5	77.2	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	33.8	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	62.2	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	55.9	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@1	30.1	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@10	54.8	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@5	47.7	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	32	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	59.4	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	52.4	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@1	27.3	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@10	51.6	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@5	44.2	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@1	63.2	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@10	92.5	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@5	85.6	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@1	56.5	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@10	90.3	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@5	82.8	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@1	60.4	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@10	90.8	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@5	83.9	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@1	54.8	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@10	89.5	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@5	81.5	InternVideo2-1B

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	HACS	Average-mAP	43.3	InternVideo2-6B
Video	HACS	Average-mAP	42.4	InternVideo2-1B
Video	ActivityNet-1.3	mAP	41.2	InternVideo2-6B
Video	ActivityNet-1.3	mAP	40.4	InternVideo2-1B
Video	FineAction	mAP	27.7	InternVideo2-6B
Video	THUMOS’14	Avg mAP (0.3:0.7)	72	InternVideo2-6B
Video	THUMOS’14	Avg mAP (0.3:0.7)	69.8	InternVideo2-1B
Video	VATEX	text-to-video R@1	75.5	InternVideo2-6B
Video	VATEX	video-to-text R@1	89.3	InternVideo2-6B
Video	ActivityNet	text-to-video R@1	74.1	InternVideo2-6B
Video	ActivityNet	video-to-text R@1	69.7	InternVideo2-6B
Video	DiDeMo	text-to-video R@1	74.2	InternVideo2-6B
Video	DiDeMo	video-to-text R@1	71.9	InternVideo2-6B
Video	MSR-VTT	text-to-video R@1	62.8	InternVideo2-6B
Video	MSR-VTT	video-to-text R@1	60.2	InternVideo2-6B
Video	LSMDC	text-to-video R@1	46.4	InternVideo2-6B
Video	LSMDC	video-to-text R@1	46.7	InternVideo2-6B
Video	MSVD	text-to-video R@1	61.4	InternVideo2-6B
Video	MSVD	video-to-text R@1	85.2	InternVideo2-6B
Video	QVHighlights	R@1,IoU=0.5	71.42	InternVideo2-6B
Video	QVHighlights	R@1,IoU=0.7	56.45	InternVideo2-6B
Video	QVHighlights	R@1,IoU=0.5	70	InternVideo2-1B
Video	QVHighlights	R@1,IoU=0.7	54.45	InternVideo2-1B
Video	Kinetics-700	Top-1 Accuracy	85.9	InternVideo2-6B
Video	Kinetics-700	Top-1 Accuracy	85.4	InternVideo2-1B
Video	MiT	Top 1 Accuracy	50.9	InternVideo2-1B
Video	Kinetics-400	Acc@1	92.1	InternVideo2-6B
Video	Kinetics-400	Acc@1	91.6	InternVideo2-1B
Video	Kinetics-600	Top-1 Accuracy	91.9	InternVideo2-6B
Video	Kinetics-600	Top-1 Accuracy	91.6	InternVideo2-1B
Video	MIT	Top 1 Accuracy	51.2	InternVideo2-6B
Temporal Action Localization	HACS	Average-mAP	43.3	InternVideo2-6B
Temporal Action Localization	HACS	Average-mAP	42.4	InternVideo2-1B
Temporal Action Localization	ActivityNet-1.3	mAP	41.2	InternVideo2-6B
Temporal Action Localization	ActivityNet-1.3	mAP	40.4	InternVideo2-1B
Temporal Action Localization	FineAction	mAP	27.7	InternVideo2-6B
Temporal Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	72	InternVideo2-6B
Temporal Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	69.8	InternVideo2-1B
Zero-Shot Learning	HACS	Average-mAP	43.3	InternVideo2-6B
Zero-Shot Learning	HACS	Average-mAP	42.4	InternVideo2-1B
Zero-Shot Learning	ActivityNet-1.3	mAP	41.2	InternVideo2-6B
Zero-Shot Learning	ActivityNet-1.3	mAP	40.4	InternVideo2-1B
Zero-Shot Learning	FineAction	mAP	27.7	InternVideo2-6B
Zero-Shot Learning	THUMOS’14	Avg mAP (0.3:0.7)	72	InternVideo2-6B
Zero-Shot Learning	THUMOS’14	Avg mAP (0.3:0.7)	69.8	InternVideo2-1B
Question Answering	MVBench	Accuracy	60.9	InternVideo2-1B
Question Answering	EgoSchema (fullset)	Accuracy	60.2	InternVideo2-6B
Video Question Answering	Perception Test	Accuracy (Top-1)	63.4	InternVideo2 (8B)
Video Question Answering	MVBench	Avg.	67.2	InternVideo2
Video Question Answering	MVBench	Accuracy	60.9	InternVideo2-1B
Video Question Answering	EgoSchema (fullset)	Accuracy	60.2	InternVideo2-6B
Activity Recognition	HACS	Top 1 Accuracy	97	InternVideo2-6B
Activity Recognition	Something-Something V2	Top-1 Accuracy	77.1	InternVideo2-1B
Activity Recognition	Something-Something V2	GFLOPs	13321	InternVideo2-6B
Activity Recognition	Something-Something V2	Parameters	2131	InternVideo2-6B
Activity Recognition	Something-Something V2	Top-1 Accuracy	1	InternVideo2-6B
Activity Recognition	Something-Something V2	Top-5 Accuracy	12	InternVideo2-6B
Activity Recognition	ActivityNet	mAP	95.9	InternVideo2-6B
Action Localization	HACS	Average-mAP	43.3	InternVideo2-6B
Action Localization	HACS	Average-mAP	42.4	InternVideo2-1B
Action Localization	ActivityNet-1.3	mAP	41.2	InternVideo2-6B
Action Localization	ActivityNet-1.3	mAP	40.4	InternVideo2-1B
Action Localization	FineAction	mAP	27.7	InternVideo2-6B
Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	72	InternVideo2-6B
Action Localization	THUMOS’14	Avg mAP (0.3:0.7)	69.8	InternVideo2-1B
Audio Classification	ESC-50	Accuracy (5-fold)	98.6	InternVideo2
Audio Classification	ESC-50	Top-1 Accuracy	98.6	InternVideo2
Action Recognition	HACS	Top 1 Accuracy	97	InternVideo2-6B
Action Recognition	Something-Something V2	Top-1 Accuracy	77.1	InternVideo2-1B
Action Recognition	Something-Something V2	GFLOPs	13321	InternVideo2-6B
Action Recognition	Something-Something V2	Parameters	2131	InternVideo2-6B
Action Recognition	Something-Something V2	Top-1 Accuracy	1	InternVideo2-6B
Action Recognition	Something-Something V2	Top-5 Accuracy	12	InternVideo2-6B
Action Recognition	ActivityNet	mAP	95.9	InternVideo2-6B
Video Retrieval	VATEX	text-to-video R@1	75.5	InternVideo2-6B
Video Retrieval	VATEX	video-to-text R@1	89.3	InternVideo2-6B
Video Retrieval	ActivityNet	text-to-video R@1	74.1	InternVideo2-6B
Video Retrieval	ActivityNet	video-to-text R@1	69.7	InternVideo2-6B
Video Retrieval	DiDeMo	text-to-video R@1	74.2	InternVideo2-6B
Video Retrieval	DiDeMo	video-to-text R@1	71.9	InternVideo2-6B
Video Retrieval	MSR-VTT	text-to-video R@1	62.8	InternVideo2-6B
Video Retrieval	MSR-VTT	video-to-text R@1	60.2	InternVideo2-6B
Video Retrieval	LSMDC	text-to-video R@1	46.4	InternVideo2-6B
Video Retrieval	LSMDC	video-to-text R@1	46.7	InternVideo2-6B
Video Retrieval	MSVD	text-to-video R@1	61.4	InternVideo2-6B
Video Retrieval	MSVD	video-to-text R@1	85.2	InternVideo2-6B
Video Retrieval	QVHighlights	R@1,IoU=0.5	71.42	InternVideo2-6B
Video Retrieval	QVHighlights	R@1,IoU=0.7	56.45	InternVideo2-6B
Video Retrieval	QVHighlights	R@1,IoU=0.5	70	InternVideo2-1B
Video Retrieval	QVHighlights	R@1,IoU=0.7	54.45	InternVideo2-1B
Moment Retrieval	Charades-STA	R@1 IoU=0.5	70.03	InternVideo2-6B
Moment Retrieval	Charades-STA	R@1 IoU=0.7	48.95	InternVideo2-6B
Moment Retrieval	Charades-STA	R@1 IoU=0.5	68.36	InternVideo2-1B
Moment Retrieval	Charades-STA	R@1 IoU=0.7	45.03	InternVideo2-1B
Moment Retrieval	QVHighlights	R@1 IoU=0.5	71.42	InternVideo2-6B
Moment Retrieval	QVHighlights	R@1 IoU=0.7	56.45	InternVideo2-6B
Moment Retrieval	QVHighlights	mAP	49.24	InternVideo2-6B
Classification	ESC-50	Accuracy (5-fold)	98.6	InternVideo2
Classification	ESC-50	Top-1 Accuracy	98.6	InternVideo2
Video Grounding	QVHighlights	R@1,IoU=0.5	71.42	InternVideo2-6B
Video Grounding	QVHighlights	R@1,IoU=0.7	56.45	InternVideo2-6B
Video Grounding	QVHighlights	R@1,IoU=0.5	70	InternVideo2-1B
Video Grounding	QVHighlights	R@1,IoU=0.7	54.45	InternVideo2-1B
Text to Audio Retrieval	AudioCaps	R@1	55.2	InternVideo2-6B
Text to Audio Retrieval	Clotho	R@1	27.2	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	text-to-video R@1	71.5	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	text-to-video R@10	97.1	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	text-to-video R@5	94	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	video-to-text R@1	85.3	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	video-to-text R@10	99.3	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	video-to-text R@5	97.9	InternVideo2-6B
Zero-Shot Video Retrieval	VATEX	text-to-video R@1	70.4	InternVideo2-1B
Zero-Shot Video Retrieval	VATEX	text-to-video R@10	96.9	InternVideo2-1B
Zero-Shot Video Retrieval	VATEX	text-to-video R@5	93.4	InternVideo2-1B
Zero-Shot Video Retrieval	VATEX	video-to-text R@1	85.4	InternVideo2-1B
Zero-Shot Video Retrieval	VATEX	video-to-text R@10	99.1	InternVideo2-1B
Zero-Shot Video Retrieval	VATEX	video-to-text R@5	97.6	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	55.9	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	85.1	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	78.3	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@1	53.7	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@10	84.1	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@5	77.5	InternVideo2-6B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	51.9	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	82.5	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	75.3	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@1	50.9	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@10	81.8	InternVideo2-1B
Zero-Shot Video Retrieval	MSR-VTT	video-to-text R@5	73.4	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	59.3	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	89.6	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	84.4	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	video-to-text R@1	83.1	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	video-to-text R@10	97	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	video-to-text R@5	94.2	InternVideo2-6B
Zero-Shot Video Retrieval	MSVD	text-to-video R@1	58.1	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	text-to-video R@10	88.4	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	text-to-video R@5	83	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	video-to-text R@1	83.3	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	video-to-text R@10	96.9	InternVideo2-1B
Zero-Shot Video Retrieval	MSVD	video-to-text R@5	94.3	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	57.9	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	84.6	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	80	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@1	57.1	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@10	85	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@5	79.9	InternVideo2-6B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@1	57	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@10	85.1	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	text-to-video R@5	80	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@1	54.3	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@10	83.5	InternVideo2-1B
Zero-Shot Video Retrieval	DiDeMo	video-to-text R@5	77.2	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	33.8	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	62.2	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	55.9	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@1	30.1	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@10	54.8	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@5	47.7	InternVideo2-6B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@1	32	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@10	59.4	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	text-to-video R@5	52.4	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@1	27.3	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@10	51.6	InternVideo2-1B
Zero-Shot Video Retrieval	LSMDC	video-to-text R@5	44.2	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@1	63.2	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@10	92.5	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@5	85.6	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@1	56.5	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@10	90.3	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@5	82.8	InternVideo2-6B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@1	60.4	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@10	90.8	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	text-to-video R@5	83.9	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@1	54.8	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@10	89.5	InternVideo2-1B
Zero-Shot Video Retrieval	ActivityNet	video-to-text R@5	81.5	InternVideo2-1B

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Abstract

Results

Related Papers

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Abstract

Results

Related Papers