HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic

2019-06-07ICCV 2019 10Video Retrieval Action Localization Long Video Retrieval (Background Removed)Text to Video Retrieval Retrieval

Paper PDF Code Code Code Code

Abstract

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

Results

Task	Dataset	Metric	Value	Model
Video	CrossTask	Recall	33.6	Text-Video Embedding
Video	MSR-VTT-1kA	text-to-video Median Rank	9	HT-Pretrained
Video	MSR-VTT-1kA	text-to-video R@1	14.9	HT-Pretrained
Video	MSR-VTT-1kA	text-to-video R@10	52.8	HT-Pretrained
Video	MSR-VTT-1kA	text-to-video R@5	40.2	HT-Pretrained
Video	MSR-VTT-1kA	text-to-video Median Rank	12	HT
Video	MSR-VTT-1kA	text-to-video R@1	12.1	HT
Video	MSR-VTT-1kA	text-to-video R@10	48	HT
Video	MSR-VTT-1kA	text-to-video R@5	35	HT
Video	YouCook2	text-to-video Median Rank	24	Text-Video Embedding
Video	YouCook2	text-to-video R@1	8.2	Text-Video Embedding
Video	YouCook2	text-to-video R@10	35.3	Text-Video Embedding
Video	YouCook2	text-to-video R@5	24.5	Text-Video Embedding
Video	MSR-VTT	text-to-video Median Rank	9	Text-Video Embedding
Video	MSR-VTT	text-to-video R@1	14.9	Text-Video Embedding
Video	MSR-VTT	text-to-video R@10	52.8	Text-Video Embedding
Video	MSR-VTT	video-to-text R@5	40.2	Text-Video Embedding
Video	LSMDC	text-to-video Median Rank	40	Text-Video Embedding
Video	LSMDC	text-to-video R@1	7.2	Text-Video Embedding
Video	LSMDC	text-to-video R@10	27.9	Text-Video Embedding
Video	LSMDC	text-to-video R@5	19.6	Text-Video Embedding
Temporal Action Localization	CrossTask	Recall	33.6	Text-Video Embedding
Zero-Shot Learning	CrossTask	Recall	33.6	Text-Video Embedding
Action Localization	CrossTask	Recall	33.6	Text-Video Embedding
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	9	HT-Pretrained
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	14.9	HT-Pretrained
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	52.8	HT-Pretrained
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	40.2	HT-Pretrained
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	12	HT
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	12.1	HT
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	48	HT
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	35	HT
Video Retrieval	YouCook2	text-to-video Median Rank	24	Text-Video Embedding
Video Retrieval	YouCook2	text-to-video R@1	8.2	Text-Video Embedding
Video Retrieval	YouCook2	text-to-video R@10	35.3	Text-Video Embedding
Video Retrieval	YouCook2	text-to-video R@5	24.5	Text-Video Embedding
Video Retrieval	MSR-VTT	text-to-video Median Rank	9	Text-Video Embedding
Video Retrieval	MSR-VTT	text-to-video R@1	14.9	Text-Video Embedding
Video Retrieval	MSR-VTT	text-to-video R@10	52.8	Text-Video Embedding
Video Retrieval	MSR-VTT	video-to-text R@5	40.2	Text-Video Embedding
Video Retrieval	LSMDC	text-to-video Median Rank	40	Text-Video Embedding
Video Retrieval	LSMDC	text-to-video R@1	7.2	Text-Video Embedding
Video Retrieval	LSMDC	text-to-video R@10	27.9	Text-Video Embedding
Video Retrieval	LSMDC	text-to-video R@5	19.6	Text-Video Embedding
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@1	46.6	Text-Video Embedding
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@10	83.7	Text-Video Embedding
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@5	74.3	Text-Video Embedding

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	CrossTask	Recall	33.6	Text-Video Embedding
Video	MSR-VTT-1kA	text-to-video Median Rank	9	HT-Pretrained
Video	MSR-VTT-1kA	text-to-video R@1	14.9	HT-Pretrained
Video	MSR-VTT-1kA	text-to-video R@10	52.8	HT-Pretrained
Video	MSR-VTT-1kA	text-to-video R@5	40.2	HT-Pretrained
Video	MSR-VTT-1kA	text-to-video Median Rank	12	HT
Video	MSR-VTT-1kA	text-to-video R@1	12.1	HT
Video	MSR-VTT-1kA	text-to-video R@10	48	HT
Video	MSR-VTT-1kA	text-to-video R@5	35	HT
Video	YouCook2	text-to-video Median Rank	24	Text-Video Embedding
Video	YouCook2	text-to-video R@1	8.2	Text-Video Embedding
Video	YouCook2	text-to-video R@10	35.3	Text-Video Embedding
Video	YouCook2	text-to-video R@5	24.5	Text-Video Embedding
Video	MSR-VTT	text-to-video Median Rank	9	Text-Video Embedding
Video	MSR-VTT	text-to-video R@1	14.9	Text-Video Embedding
Video	MSR-VTT	text-to-video R@10	52.8	Text-Video Embedding
Video	MSR-VTT	video-to-text R@5	40.2	Text-Video Embedding
Video	LSMDC	text-to-video Median Rank	40	Text-Video Embedding
Video	LSMDC	text-to-video R@1	7.2	Text-Video Embedding
Video	LSMDC	text-to-video R@10	27.9	Text-Video Embedding
Video	LSMDC	text-to-video R@5	19.6	Text-Video Embedding
Temporal Action Localization	CrossTask	Recall	33.6	Text-Video Embedding
Zero-Shot Learning	CrossTask	Recall	33.6	Text-Video Embedding
Action Localization	CrossTask	Recall	33.6	Text-Video Embedding
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	9	HT-Pretrained
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	14.9	HT-Pretrained
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	52.8	HT-Pretrained
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	40.2	HT-Pretrained
Video Retrieval	MSR-VTT-1kA	text-to-video Median Rank	12	HT
Video Retrieval	MSR-VTT-1kA	text-to-video R@1	12.1	HT
Video Retrieval	MSR-VTT-1kA	text-to-video R@10	48	HT
Video Retrieval	MSR-VTT-1kA	text-to-video R@5	35	HT
Video Retrieval	YouCook2	text-to-video Median Rank	24	Text-Video Embedding
Video Retrieval	YouCook2	text-to-video R@1	8.2	Text-Video Embedding
Video Retrieval	YouCook2	text-to-video R@10	35.3	Text-Video Embedding
Video Retrieval	YouCook2	text-to-video R@5	24.5	Text-Video Embedding
Video Retrieval	MSR-VTT	text-to-video Median Rank	9	Text-Video Embedding
Video Retrieval	MSR-VTT	text-to-video R@1	14.9	Text-Video Embedding
Video Retrieval	MSR-VTT	text-to-video R@10	52.8	Text-Video Embedding
Video Retrieval	MSR-VTT	video-to-text R@5	40.2	Text-Video Embedding
Video Retrieval	LSMDC	text-to-video Median Rank	40	Text-Video Embedding
Video Retrieval	LSMDC	text-to-video R@1	7.2	Text-Video Embedding
Video Retrieval	LSMDC	text-to-video R@10	27.9	Text-Video Embedding
Video Retrieval	LSMDC	text-to-video R@5	19.6	Text-Video Embedding
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@1	46.6	Text-Video Embedding
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@10	83.7	Text-Video Embedding
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@5	74.3	Text-Video Embedding

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Abstract

Results

Related Papers

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Abstract

Results

Related Papers