Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | HACS | Average-mAP | 41.55 | InternVideo |
| Video | ActivityNet-1.3 | mAP | 39 | InternVideo |
| Video | FineAction | mAP | 17.57 | InternVideo |
| Video | THUMOS’14 | Avg mAP (0.3:0.7) | 71.58 | ActionFormer (InternVideo features) |
| Video | VATEX | text-to-video R@1 | 71.1 | InternVideo |
| Video | VATEX | video-to-text R@1 | 87.2 | InternVideo |
| Video | ActivityNet | text-to-video R@1 | 62.2 | InternVideo |
| Video | ActivityNet | video-to-text R@1 | 62.8 | InternVideo |
| Video | DiDeMo | text-to-video R@1 | 57.9 | InternVideo |
| Video | DiDeMo | video-to-text R@1 | 59.1 | InternVideo |
| Video | MSR-VTT | text-to-video R@1 | 55.2 | InternVideo |
| Video | MSR-VTT | video-to-text R@1 | 57.9 | InternVideo |
| Video | LSMDC | text-to-video R@1 | 34 | InternVideo |
| Video | LSMDC | video-to-text R@1 | 34.9 | InternVideo |
| Video | MSVD | text-to-video R@1 | 58.4 | InternVideo |
| Video | MSVD | video-to-text R@1 | 76.3 | InternVideo |
| Video | Kinetics-700 | Top-1 Accuracy | 84 | InternVideo-T |
| Video | Kinetics-400 | Acc@1 | 91.1 | InternVideo |
| Video | Kinetics-600 | Top-1 Accuracy | 91.3 | InternVideo-T |
| Temporal Action Localization | HACS | Average-mAP | 41.55 | InternVideo |
| Temporal Action Localization | ActivityNet-1.3 | mAP | 39 | InternVideo |
| Temporal Action Localization | FineAction | mAP | 17.57 | InternVideo |
| Temporal Action Localization | THUMOS’14 | Avg mAP (0.3:0.7) | 71.58 | ActionFormer (InternVideo features) |
| Zero-Shot Learning | HACS | Average-mAP | 41.55 | InternVideo |
| Zero-Shot Learning | ActivityNet-1.3 | mAP | 39 | InternVideo |
| Zero-Shot Learning | FineAction | mAP | 17.57 | InternVideo |
| Zero-Shot Learning | THUMOS’14 | Avg mAP (0.3:0.7) | 71.58 | ActionFormer (InternVideo features) |
| Question Answering | STAR Benchmark | Accuracy | 41.6 | InternVideo |
| Question Answering | TVQA | Accuracy | 35.9 | InternVideo (no speech) |
| Question Answering | EgoSchema (fullset) | Accuracy | 32.1 | InternVideo |
| Visual Question Answering (VQA) | MSRVTT-QA | Accuracy | 0.471 | InternVideo |
| Visual Question Answering (VQA) | MSVD-QA | Accuracy | 0.555 | InternVideo |
| Visual Question Answering (VQA) | TGIF-QA | Accuracy | 0.722 | InternVideo |
| Video Question Answering | STAR Benchmark | Average Accuracy | 58.7 | InternVideo |
| Video Question Answering | STAR Benchmark | Accuracy | 41.6 | InternVideo |
| Video Question Answering | TVQA | Accuracy | 35.9 | InternVideo (no speech) |
| Video Question Answering | EgoSchema (fullset) | Accuracy | 32.1 | InternVideo |
| Activity Recognition | Something-Something V1 | Top 1 Accuracy | 70 | InternVideo |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 77.2 | InternVideo |
| Activity Recognition | AVA v2.2 | mAP | 41.01 | InternVideo |
| Activity Recognition | UCF101-MiTv2 | AUROC | 91.85 | InternVideo |
| Activity Recognition | UCF-HMDB | AUROC | 85.48 | InternVideo |
| Action Localization | HACS | Average-mAP | 41.55 | InternVideo |
| Action Localization | ActivityNet-1.3 | mAP | 39 | InternVideo |
| Action Localization | FineAction | mAP | 17.57 | InternVideo |
| Action Localization | THUMOS’14 | Avg mAP (0.3:0.7) | 71.58 | ActionFormer (InternVideo features) |
| Action Localization | AVA-Kinetics | val mAP | 41.01 | InternVideo |
| Action Recognition | Something-Something V1 | Top 1 Accuracy | 70 | InternVideo |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 77.2 | InternVideo |
| Action Recognition | AVA v2.2 | mAP | 41.01 | InternVideo |
| Action Recognition | UCF101-MiTv2 | AUROC | 91.85 | InternVideo |
| Action Recognition | UCF-HMDB | AUROC | 85.48 | InternVideo |
| Video Retrieval | VATEX | text-to-video R@1 | 71.1 | InternVideo |
| Video Retrieval | VATEX | video-to-text R@1 | 87.2 | InternVideo |
| Video Retrieval | ActivityNet | text-to-video R@1 | 62.2 | InternVideo |
| Video Retrieval | ActivityNet | video-to-text R@1 | 62.8 | InternVideo |
| Video Retrieval | DiDeMo | text-to-video R@1 | 57.9 | InternVideo |
| Video Retrieval | DiDeMo | video-to-text R@1 | 59.1 | InternVideo |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 55.2 | InternVideo |
| Video Retrieval | MSR-VTT | video-to-text R@1 | 57.9 | InternVideo |
| Video Retrieval | LSMDC | text-to-video R@1 | 34 | InternVideo |
| Video Retrieval | LSMDC | video-to-text R@1 | 34.9 | InternVideo |
| Video Retrieval | MSVD | text-to-video R@1 | 58.4 | InternVideo |
| Video Retrieval | MSVD | video-to-text R@1 | 76.3 | InternVideo |
| Zero-Shot Video Retrieval | VATEX | text-to-video R@1 | 49.5 | InternVideo |
| Zero-Shot Video Retrieval | VATEX | video-to-text R@1 | 69.5 | InternVideo |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@1 | 40.7 | InternVideo |
| Zero-Shot Video Retrieval | MSR-VTT | video-to-text R@1 | 39.6 | InternVideo |
| Zero-Shot Video Retrieval | MSVD | text-to-video R@1 | 43.4 | InternVideo |
| Zero-Shot Video Retrieval | MSVD | video-to-text R@1 | 67.6 | InternVideo |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@1 | 31.5 | InternVideo |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@10 | 68.2 | InternVideo |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@5 | 57.6 | InternVideo |
| Zero-Shot Video Retrieval | DiDeMo | video-to-text R@1 | 33.5 | InternVideo |
| Zero-Shot Video Retrieval | DiDeMo | video-to-text R@10 | 71.1 | InternVideo |
| Zero-Shot Video Retrieval | DiDeMo | video-to-text R@5 | 60.3 | InternVideo |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@1 | 17.6 | InternVideo |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@10 | 40.2 | InternVideo |
| Zero-Shot Video Retrieval | LSMDC | text-to-video R@5 | 32.4 | InternVideo |
| Zero-Shot Video Retrieval | LSMDC | video-to-text R@1 | 13.2 | InternVideo |
| Zero-Shot Video Retrieval | LSMDC | video-to-text R@10 | 34.9 | InternVideo |
| Zero-Shot Video Retrieval | LSMDC | video-to-text R@5 | 27.8 | InternVideo |
| Zero-Shot Video Retrieval | ActivityNet | text-to-video R@1 | 30.7 | InternVideo |
| Zero-Shot Video Retrieval | ActivityNet | video-to-text R@1 | 31.4 | InternVideo |