TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VideoCoCa: Video-Text Modeling with Zero-Shot Transfer fro...

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Shen Yan, Tao Zhu, ZiRui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu

2022-12-09Question AnsweringVideo RetrievalZero-Shot Video RetrievalText to Video RetrievalVideo Question AnsweringVideo CaptioningZero-Shot Action RecognitionVideo ClassificationRetrievalVisual Question Answering (VQA)Video to Text Retrieval
PaperPDF

Abstract

We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

Results

TaskDatasetMetricValueModel
VideoYouCook2text-to-video R@121.7VideoCoCa (zero-shot)
VideoYouCook2text-to-video R@1055.2VideoCoCa (zero-shot)
VideoYouCook2text-to-video R@543.9VideoCoCa (zero-shot)
VideoMSR-VTTtext-to-video R@134.3VideoCoCa (zero-shot)
VideoMSR-VTTtext-to-video R@1067VideoCoCa (zero-shot)
VideoMSR-VTTtext-to-video R@557.8VideoCoCa (zero-shot)
VideoMSR-VTTvideo-to-text R@164.7VideoCoCa (zero-shot)
VideoMSR-VTTvideo-to-text R@1091.4VideoCoCa (zero-shot)
VideoMSR-VTTvideo-to-text R@585.2VideoCoCa (zero-shot)
Visual Question Answering (VQA)MSRVTT-QAAccuracy0.463VideoCoCa
Visual Question Answering (VQA)MSVD-QAAccuracy0.569VideoCoCa
Video Question AnsweringActivityNet-QAAccuracy56.1VideoCoCa
Video Question AnsweringiVQAAccuracy39VideoCoCa
Video CaptioningMSR-VTTBLEU-453.8VideoCoCa
Video CaptioningMSR-VTTCIDEr73.2VideoCoCa
Video CaptioningMSR-VTTROUGE-L68VideoCoCa
Video CaptioningVATEXBLEU-439.7VideoCoCa
Video CaptioningVATEXCIDEr77.8VideoCoCa
Video CaptioningVATEXROUGE-L54.5VideoCoCa
Video CaptioningYouCook2BLEU-414.2VideoCoCa
Video CaptioningYouCook2CIDEr1.28VideoCoCa
Video CaptioningYouCook2ROUGE-L37.7VideoCoCa
Video CaptioningActivityNet CaptionsBLEU414.7VideoCoCa
Video CaptioningActivityNet CaptionsCIDEr39.3VideoCoCa
Video CaptioningActivityNet CaptionsROUGE-L35VideoCoCa
Video RetrievalYouCook2text-to-video R@121.7VideoCoCa (zero-shot)
Video RetrievalYouCook2text-to-video R@1055.2VideoCoCa (zero-shot)
Video RetrievalYouCook2text-to-video R@543.9VideoCoCa (zero-shot)
Video RetrievalMSR-VTTtext-to-video R@134.3VideoCoCa (zero-shot)
Video RetrievalMSR-VTTtext-to-video R@1067VideoCoCa (zero-shot)
Video RetrievalMSR-VTTtext-to-video R@557.8VideoCoCa (zero-shot)
Video RetrievalMSR-VTTvideo-to-text R@164.7VideoCoCa (zero-shot)
Video RetrievalMSR-VTTvideo-to-text R@1091.4VideoCoCa (zero-shot)
Video RetrievalMSR-VTTvideo-to-text R@585.2VideoCoCa (zero-shot)
Zero-Shot Action RecognitionUCF101Top-1 Accuracy86.6VideoCoCa
Zero-Shot Action RecognitionUCF101Top-5 accuracy98.4VideoCoCa
Zero-Shot Action RecognitionKineticsTop-1 Accuracy70.1VideoCoCa
Zero-Shot Action RecognitionKineticsTop-5 Accuracy88.9VideoCoCa
Zero-Shot Action RecognitionCharadesmAP25.8VideoCoCa
Zero-Shot Action RecognitionHMDB51Top-1 Accuracy58.7VideoCoCa
Zero-Shot Action RecognitionHMDB51Top-5 Accuracy84.5VideoCoCa
Zero-Shot Video RetrievalVATEXtext-to-video R@153.2VideoCoCa
Zero-Shot Video RetrievalVATEXtext-to-video R@1090.1VideoCoCa
Zero-Shot Video RetrievalVATEXtext-to-video R@583.3VideoCoCa
Zero-Shot Video RetrievalVATEXvideo-to-text R@173.6VideoCoCa
Zero-Shot Video RetrievalVATEXvideo-to-text R@1097.2VideoCoCa
Zero-Shot Video RetrievalVATEXvideo-to-text R@593.2VideoCoCa
Zero-Shot Video RetrievalMSR-VTT-fulltext-to-video R@134.3VideoCoCa
Zero-Shot Video RetrievalMSR-VTT-fulltext-to-video R@1067VideoCoCa
Zero-Shot Video RetrievalMSR-VTT-fulltext-to-video R@557.8VideoCoCa
Zero-Shot Video RetrievalMSR-VTT-fullvideo-to-text R@164.7VideoCoCa
Zero-Shot Video RetrievalMSR-VTT-fullvideo-to-text R@1091.4VideoCoCa
Zero-Shot Video RetrievalMSR-VTT-fullvideo-to-text R@585.2VideoCoCa
Zero-Shot Video RetrievalActivityNettext-to-video R@134.5VideoCoCa
Zero-Shot Video RetrievalActivityNettext-to-video R@1076.6VideoCoCa
Zero-Shot Video RetrievalActivityNettext-to-video R@563.2VideoCoCa
Zero-Shot Video RetrievalActivityNetvideo-to-text R@133VideoCoCa
Zero-Shot Video RetrievalActivityNetvideo-to-text R@1075.3VideoCoCa
Zero-Shot Video RetrievalActivityNetvideo-to-text R@561.6VideoCoCa
Zero-Shot Video RetrievalYouCook2text-to-video R@120.3VideoCOca
Zero-Shot Video RetrievalYouCook2text-to-video R@1053.3VideoCOca
Zero-Shot Video RetrievalYouCook2text-to-video R@543VideoCOca

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17