TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/HowTo100M: Learning a Text-Video Embedding by Watching Hun...

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

2019-06-07ICCV 2019 10Video RetrievalAction LocalizationLong Video Retrieval (Background Removed)Text to Video RetrievalRetrieval
PaperPDFCodeCodeCodeCode

Abstract

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

Results

TaskDatasetMetricValueModel
VideoCrossTaskRecall33.6Text-Video Embedding
VideoMSR-VTT-1kAtext-to-video Median Rank9HT-Pretrained
VideoMSR-VTT-1kAtext-to-video R@114.9HT-Pretrained
VideoMSR-VTT-1kAtext-to-video R@1052.8HT-Pretrained
VideoMSR-VTT-1kAtext-to-video R@540.2HT-Pretrained
VideoMSR-VTT-1kAtext-to-video Median Rank12HT
VideoMSR-VTT-1kAtext-to-video R@112.1HT
VideoMSR-VTT-1kAtext-to-video R@1048HT
VideoMSR-VTT-1kAtext-to-video R@535HT
VideoYouCook2text-to-video Median Rank24Text-Video Embedding
VideoYouCook2text-to-video R@18.2Text-Video Embedding
VideoYouCook2text-to-video R@1035.3Text-Video Embedding
VideoYouCook2text-to-video R@524.5Text-Video Embedding
VideoMSR-VTTtext-to-video Median Rank9Text-Video Embedding
VideoMSR-VTTtext-to-video R@114.9Text-Video Embedding
VideoMSR-VTTtext-to-video R@1052.8Text-Video Embedding
VideoMSR-VTTvideo-to-text R@540.2Text-Video Embedding
VideoLSMDCtext-to-video Median Rank40Text-Video Embedding
VideoLSMDCtext-to-video R@17.2Text-Video Embedding
VideoLSMDCtext-to-video R@1027.9Text-Video Embedding
VideoLSMDCtext-to-video R@519.6Text-Video Embedding
Temporal Action LocalizationCrossTaskRecall33.6Text-Video Embedding
Zero-Shot LearningCrossTaskRecall33.6Text-Video Embedding
Action LocalizationCrossTaskRecall33.6Text-Video Embedding
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank9HT-Pretrained
Video RetrievalMSR-VTT-1kAtext-to-video R@114.9HT-Pretrained
Video RetrievalMSR-VTT-1kAtext-to-video R@1052.8HT-Pretrained
Video RetrievalMSR-VTT-1kAtext-to-video R@540.2HT-Pretrained
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank12HT
Video RetrievalMSR-VTT-1kAtext-to-video R@112.1HT
Video RetrievalMSR-VTT-1kAtext-to-video R@1048HT
Video RetrievalMSR-VTT-1kAtext-to-video R@535HT
Video RetrievalYouCook2text-to-video Median Rank24Text-Video Embedding
Video RetrievalYouCook2text-to-video R@18.2Text-Video Embedding
Video RetrievalYouCook2text-to-video R@1035.3Text-Video Embedding
Video RetrievalYouCook2text-to-video R@524.5Text-Video Embedding
Video RetrievalMSR-VTTtext-to-video Median Rank9Text-Video Embedding
Video RetrievalMSR-VTTtext-to-video R@114.9Text-Video Embedding
Video RetrievalMSR-VTTtext-to-video R@1052.8Text-Video Embedding
Video RetrievalMSR-VTTvideo-to-text R@540.2Text-Video Embedding
Video RetrievalLSMDCtext-to-video Median Rank40Text-Video Embedding
Video RetrievalLSMDCtext-to-video R@17.2Text-Video Embedding
Video RetrievalLSMDCtext-to-video R@1027.9Text-Video Embedding
Video RetrievalLSMDCtext-to-video R@519.6Text-Video Embedding
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@146.6Text-Video Embedding
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@1083.7Text-Video Embedding
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@574.3Text-Video Embedding

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15