TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Frozen in Time: A Joint Video and Image Encoder for End-to...

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Max Bain, Arsha Nagrani, Gül Varol, Andrew Zisserman

2021-04-01ICCV 2021 10Video RetrievalVideo-Text RetrievalZero-Shot Video RetrievalText RetrievalText to Video RetrievalVideo CaptioningRetrieval
PaperPDFCodeCode(official)CodeCodeCode

Abstract

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval. The challenges in this area include the design of the visual architecture and the nature of the training data, in that the available large scale video-text training datasets, such as HowTo100M, are noisy and hence competitive performance is achieved only at scale through large amounts of compute. We address both these challenges in this paper. We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is an adaptation and extension of the recent ViT and Timesformer architectures, and consists of attention in both space and time. The model is flexible and can be trained on both image and video text datasets, either independently or in conjunction. It is trained with a curriculum learning schedule that begins by treating images as 'frozen' snapshots of video, and then gradually learns to attend to increasing temporal context when trained on video datasets. We also provide a new video-text pretraining dataset WebVid-2M, comprised of over two million videos with weak captions scraped from the internet. Despite training on datasets that are an order of magnitude smaller, we show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Median Rank3FROZEN
VideoMSR-VTT-1kAtext-to-video R@131FROZEN
VideoMSR-VTT-1kAtext-to-video R@1070.5FROZEN
VideoMSR-VTT-1kAtext-to-video R@559.5FROZEN
VideoDiDeMotext-to-video Median Rank3FROZEN
VideoDiDeMotext-to-video R@131FROZEN
VideoDiDeMotext-to-video R@1072.4FROZEN
VideoDiDeMotext-to-video R@559.8FROZEN
VideoMSR-VTTtext-to-video R@132.5FROZEN
VideoMSR-VTTtext-to-video R@1071.2FROZEN
VideoMSR-VTTtext-to-video R@561.5FROZEN
VideoLSMDCtext-to-video Median Rank20FROZEN
VideoLSMDCtext-to-video R@115FROZEN
VideoLSMDCtext-to-video R@1039.8FROZEN
VideoLSMDCtext-to-video R@530.8FROZEN
VideoQuerYDtext-to-video R@153.8Frozen
VideoQuerYDtext-to-video R@1082.7Frozen
VideoQuerYDtext-to-video R@575.7Frozen
VideoMSVDtext-to-video Median Rank3FROZEN
VideoMSVDtext-to-video R@133.7FROZEN
VideoMSVDtext-to-video R@1076.3FROZEN
VideoMSVDtext-to-video R@564.7FROZEN
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank3FROZEN
Video RetrievalMSR-VTT-1kAtext-to-video R@131FROZEN
Video RetrievalMSR-VTT-1kAtext-to-video R@1070.5FROZEN
Video RetrievalMSR-VTT-1kAtext-to-video R@559.5FROZEN
Video RetrievalDiDeMotext-to-video Median Rank3FROZEN
Video RetrievalDiDeMotext-to-video R@131FROZEN
Video RetrievalDiDeMotext-to-video R@1072.4FROZEN
Video RetrievalDiDeMotext-to-video R@559.8FROZEN
Video RetrievalMSR-VTTtext-to-video R@132.5FROZEN
Video RetrievalMSR-VTTtext-to-video R@1071.2FROZEN
Video RetrievalMSR-VTTtext-to-video R@561.5FROZEN
Video RetrievalLSMDCtext-to-video Median Rank20FROZEN
Video RetrievalLSMDCtext-to-video R@115FROZEN
Video RetrievalLSMDCtext-to-video R@1039.8FROZEN
Video RetrievalLSMDCtext-to-video R@530.8FROZEN
Video RetrievalQuerYDtext-to-video R@153.8Frozen
Video RetrievalQuerYDtext-to-video R@1082.7Frozen
Video RetrievalQuerYDtext-to-video R@575.7Frozen
Video RetrievalMSVDtext-to-video Median Rank3FROZEN
Video RetrievalMSVDtext-to-video R@133.7FROZEN
Video RetrievalMSVDtext-to-video R@1076.3FROZEN
Video RetrievalMSVDtext-to-video R@564.7FROZEN
Zero-Shot Video RetrievalMSR-VTTtext-to-video Median Rank7FROZEN
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@124.7FROZEN
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1057.2FROZEN
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@546.9FROZEN
Zero-Shot Video RetrievalDiDeMotext-to-video R@121.1FROZEN
Zero-Shot Video RetrievalDiDeMotext-to-video R@1056.2FROZEN
Zero-Shot Video RetrievalDiDeMotext-to-video R@546FROZEN
Zero-Shot Video RetrievalDiDeMotext-to-video Median Rank7M. Bain et. al.
Zero-Shot Video RetrievalDiDeMotext-to-video R@120.2M. Bain et. al.
Zero-Shot Video RetrievalDiDeMotext-to-video R@1058.5M. Bain et. al.
Zero-Shot Video RetrievalDiDeMotext-to-video R@546.4M. Bain et. al.

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15