End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman

2019-12-13CVPR 2020 6Action Segmentation Video Retrieval Action Localization Zero-Shot Video Retrieval Long Video Retrieval (Background Removed)Text to Video Retrieval Action Recognition Retrieval

Paper PDF Code Code Code(official)Code

Abstract

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	RareAct	mWAP	30.5	HT100M S3D
Action Localization	COIN	Frame accuracy	61	MIL-NCE
Action Localization	COIN	Frame accuracy	53.9	CBT
Action Recognition	RareAct	mWAP	30.5	HT100M S3D
Action Segmentation	COIN	Frame accuracy	61	MIL-NCE
Action Segmentation	COIN	Frame accuracy	53.9	CBT
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@1	43.1	MIL-NCE
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@10	79.1	MIL-NCE
Long Video Retrieval (Background Removed)	YouCook2	Cap. Avg. R@5	68.6	MIL-NCE
Zero-Shot Video Retrieval	MSR-VTT	text-to-video Mean Rank	29.5	MIL-NCE
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@1	9.9	MIL-NCE
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@10	32.4	MIL-NCE
Zero-Shot Video Retrieval	MSR-VTT	text-to-video R@5	24	MIL-NCE
Zero-Shot Video Retrieval	YouCook2	text-to-video Mean Rank	10	MIL-NCE
Zero-Shot Video Retrieval	YouCook2	text-to-video R@1	15.1	MIL-NCE
Zero-Shot Video Retrieval	YouCook2	text-to-video R@10	51.2	MIL-NCE
Zero-Shot Video Retrieval	YouCook2	text-to-video R@5	38	MIL-NCE

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Abstract

Results

Related Papers

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Abstract

Results

Related Papers