TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VideoCLIP: Contrastive Pre-training for Zero-shot Video-Te...

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

2021-09-28EMNLP 2021 11Action SegmentationVideo RetrievalAction LocalizationZero-Shot Video RetrievalLong Video Retrieval (Background Removed)RetrievalTemporal Action LocalizationTemporal Relation Extraction
PaperPDFCode(official)Code

Abstract

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Results

TaskDatasetMetricValueModel
Relation ExtractionVinogroundGroup Score1.2VideoCLIP
Relation ExtractionVinogroundText Score17VideoCLIP
Relation ExtractionVinogroundVideo Score2.8VideoCLIP
VideoCrossTaskRecall47.3VideoCLIP
VideoMSR-VTT-1kAtext-to-video R@130.9VideoCLIP
VideoMSR-VTT-1kAtext-to-video R@1066.8VideoCLIP
VideoMSR-VTT-1kAtext-to-video R@555.4VideoCLIP
VideoYouCook2text-to-video R@132.2VideoCLIP
VideoYouCook2text-to-video R@1075VideoCLIP
VideoYouCook2text-to-video R@562.6VideoCLIP
VideoYouCook2text-to-video R@122.7VideoCLIP (zero-shot)
VideoYouCook2text-to-video R@1063.1VideoCLIP (zero-shot)
VideoYouCook2text-to-video R@550.4VideoCLIP (zero-shot)
Temporal Action LocalizationCrossTaskRecall47.3VideoCLIP
Zero-Shot LearningCrossTaskRecall47.3VideoCLIP
Action LocalizationCrossTaskRecall47.3VideoCLIP
Action LocalizationCOINFrame accuracy68.7VideoClip
Video RetrievalMSR-VTT-1kAtext-to-video R@130.9VideoCLIP
Video RetrievalMSR-VTT-1kAtext-to-video R@1066.8VideoCLIP
Video RetrievalMSR-VTT-1kAtext-to-video R@555.4VideoCLIP
Video RetrievalYouCook2text-to-video R@132.2VideoCLIP
Video RetrievalYouCook2text-to-video R@1075VideoCLIP
Video RetrievalYouCook2text-to-video R@562.6VideoCLIP
Video RetrievalYouCook2text-to-video R@122.7VideoCLIP (zero-shot)
Video RetrievalYouCook2text-to-video R@1063.1VideoCLIP (zero-shot)
Video RetrievalYouCook2text-to-video R@550.4VideoCLIP (zero-shot)
Action SegmentationCOINFrame accuracy68.7VideoClip
Temporal Relation ExtractionVinogroundGroup Score1.2VideoCLIP
Temporal Relation ExtractionVinogroundText Score17VideoCLIP
Temporal Relation ExtractionVinogroundVideo Score2.8VideoCLIP
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@174.5VideoCLIP
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@1097.9VideoCLIP
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@594.5VideoCLIP
Long Video Retrieval (Background Removed)YouCook2DTW R@156VideoCLIP
Long Video Retrieval (Background Removed)YouCook2DTW R@1089.9VideoCLIP
Long Video Retrieval (Background Removed)YouCook2DTW R@596.3VideoCLIP
Long Video Retrieval (Background Removed)YouCook2OTAM R@152.8VideoCLIP
Long Video Retrieval (Background Removed)YouCook2OTAM R@1089.2VideoCLIP
Long Video Retrieval (Background Removed)YouCook2OTAM R@595VideoCLIP
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@110.4VideoCLIP
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1030VideoCLIP
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@522.2VideoCLIP
Zero-Shot Video RetrievalDiDeMotext-to-video R@116.6VideoCLIP
Zero-Shot Video RetrievalDiDeMotext-to-video R@546.9VideoCLIP
Zero-Shot Video RetrievalYouCook2text-to-video R@122.7VideoCLIP
Zero-Shot Video RetrievalYouCook2text-to-video R@1063.1VideoCLIP
Zero-Shot Video RetrievalYouCook2text-to-video R@550.4VideoCLIP

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16