Shen Yan, Tao Zhu, ZiRui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu
We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | YouCook2 | text-to-video R@1 | 21.7 | VideoCoCa (zero-shot) |
| Video | YouCook2 | text-to-video R@10 | 55.2 | VideoCoCa (zero-shot) |
| Video | YouCook2 | text-to-video R@5 | 43.9 | VideoCoCa (zero-shot) |
| Video | MSR-VTT | text-to-video R@1 | 34.3 | VideoCoCa (zero-shot) |
| Video | MSR-VTT | text-to-video R@10 | 67 | VideoCoCa (zero-shot) |
| Video | MSR-VTT | text-to-video R@5 | 57.8 | VideoCoCa (zero-shot) |
| Video | MSR-VTT | video-to-text R@1 | 64.7 | VideoCoCa (zero-shot) |
| Video | MSR-VTT | video-to-text R@10 | 91.4 | VideoCoCa (zero-shot) |
| Video | MSR-VTT | video-to-text R@5 | 85.2 | VideoCoCa (zero-shot) |
| Visual Question Answering (VQA) | MSRVTT-QA | Accuracy | 0.463 | VideoCoCa |
| Visual Question Answering (VQA) | MSVD-QA | Accuracy | 0.569 | VideoCoCa |
| Video Question Answering | ActivityNet-QA | Accuracy | 56.1 | VideoCoCa |
| Video Question Answering | iVQA | Accuracy | 39 | VideoCoCa |
| Video Captioning | MSR-VTT | BLEU-4 | 53.8 | VideoCoCa |
| Video Captioning | MSR-VTT | CIDEr | 73.2 | VideoCoCa |
| Video Captioning | MSR-VTT | ROUGE-L | 68 | VideoCoCa |
| Video Captioning | VATEX | BLEU-4 | 39.7 | VideoCoCa |
| Video Captioning | VATEX | CIDEr | 77.8 | VideoCoCa |
| Video Captioning | VATEX | ROUGE-L | 54.5 | VideoCoCa |
| Video Captioning | YouCook2 | BLEU-4 | 14.2 | VideoCoCa |
| Video Captioning | YouCook2 | CIDEr | 1.28 | VideoCoCa |
| Video Captioning | YouCook2 | ROUGE-L | 37.7 | VideoCoCa |
| Video Captioning | ActivityNet Captions | BLEU4 | 14.7 | VideoCoCa |
| Video Captioning | ActivityNet Captions | CIDEr | 39.3 | VideoCoCa |
| Video Captioning | ActivityNet Captions | ROUGE-L | 35 | VideoCoCa |
| Video Retrieval | YouCook2 | text-to-video R@1 | 21.7 | VideoCoCa (zero-shot) |
| Video Retrieval | YouCook2 | text-to-video R@10 | 55.2 | VideoCoCa (zero-shot) |
| Video Retrieval | YouCook2 | text-to-video R@5 | 43.9 | VideoCoCa (zero-shot) |
| Video Retrieval | MSR-VTT | text-to-video R@1 | 34.3 | VideoCoCa (zero-shot) |
| Video Retrieval | MSR-VTT | text-to-video R@10 | 67 | VideoCoCa (zero-shot) |
| Video Retrieval | MSR-VTT | text-to-video R@5 | 57.8 | VideoCoCa (zero-shot) |
| Video Retrieval | MSR-VTT | video-to-text R@1 | 64.7 | VideoCoCa (zero-shot) |
| Video Retrieval | MSR-VTT | video-to-text R@10 | 91.4 | VideoCoCa (zero-shot) |
| Video Retrieval | MSR-VTT | video-to-text R@5 | 85.2 | VideoCoCa (zero-shot) |
| Zero-Shot Action Recognition | UCF101 | Top-1 Accuracy | 86.6 | VideoCoCa |
| Zero-Shot Action Recognition | UCF101 | Top-5 accuracy | 98.4 | VideoCoCa |
| Zero-Shot Action Recognition | Kinetics | Top-1 Accuracy | 70.1 | VideoCoCa |
| Zero-Shot Action Recognition | Kinetics | Top-5 Accuracy | 88.9 | VideoCoCa |
| Zero-Shot Action Recognition | Charades | mAP | 25.8 | VideoCoCa |
| Zero-Shot Action Recognition | HMDB51 | Top-1 Accuracy | 58.7 | VideoCoCa |
| Zero-Shot Action Recognition | HMDB51 | Top-5 Accuracy | 84.5 | VideoCoCa |
| Zero-Shot Video Retrieval | VATEX | text-to-video R@1 | 53.2 | VideoCoCa |
| Zero-Shot Video Retrieval | VATEX | text-to-video R@10 | 90.1 | VideoCoCa |
| Zero-Shot Video Retrieval | VATEX | text-to-video R@5 | 83.3 | VideoCoCa |
| Zero-Shot Video Retrieval | VATEX | video-to-text R@1 | 73.6 | VideoCoCa |
| Zero-Shot Video Retrieval | VATEX | video-to-text R@10 | 97.2 | VideoCoCa |
| Zero-Shot Video Retrieval | VATEX | video-to-text R@5 | 93.2 | VideoCoCa |
| Zero-Shot Video Retrieval | MSR-VTT-full | text-to-video R@1 | 34.3 | VideoCoCa |
| Zero-Shot Video Retrieval | MSR-VTT-full | text-to-video R@10 | 67 | VideoCoCa |
| Zero-Shot Video Retrieval | MSR-VTT-full | text-to-video R@5 | 57.8 | VideoCoCa |
| Zero-Shot Video Retrieval | MSR-VTT-full | video-to-text R@1 | 64.7 | VideoCoCa |
| Zero-Shot Video Retrieval | MSR-VTT-full | video-to-text R@10 | 91.4 | VideoCoCa |
| Zero-Shot Video Retrieval | MSR-VTT-full | video-to-text R@5 | 85.2 | VideoCoCa |
| Zero-Shot Video Retrieval | ActivityNet | text-to-video R@1 | 34.5 | VideoCoCa |
| Zero-Shot Video Retrieval | ActivityNet | text-to-video R@10 | 76.6 | VideoCoCa |
| Zero-Shot Video Retrieval | ActivityNet | text-to-video R@5 | 63.2 | VideoCoCa |
| Zero-Shot Video Retrieval | ActivityNet | video-to-text R@1 | 33 | VideoCoCa |
| Zero-Shot Video Retrieval | ActivityNet | video-to-text R@10 | 75.3 | VideoCoCa |
| Zero-Shot Video Retrieval | ActivityNet | video-to-text R@5 | 61.6 | VideoCoCa |
| Zero-Shot Video Retrieval | YouCook2 | text-to-video R@1 | 20.3 | VideoCOca |
| Zero-Shot Video Retrieval | YouCook2 | text-to-video R@10 | 53.3 | VideoCOca |
| Zero-Shot Video Retrieval | YouCook2 | text-to-video R@5 | 43 | VideoCOca |