Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Activity Recognition | RareAct | mWAP | 30.5 | HT100M S3D |
| Action Localization | COIN | Frame accuracy | 61 | MIL-NCE |
| Action Localization | COIN | Frame accuracy | 53.9 | CBT |
| Action Recognition | RareAct | mWAP | 30.5 | HT100M S3D |
| Action Segmentation | COIN | Frame accuracy | 61 | MIL-NCE |
| Action Segmentation | COIN | Frame accuracy | 53.9 | CBT |
| Long Video Retrieval (Background Removed) | YouCook2 | Cap. Avg. R@1 | 43.1 | MIL-NCE |
| Long Video Retrieval (Background Removed) | YouCook2 | Cap. Avg. R@10 | 79.1 | MIL-NCE |
| Long Video Retrieval (Background Removed) | YouCook2 | Cap. Avg. R@5 | 68.6 | MIL-NCE |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video Mean Rank | 29.5 | MIL-NCE |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@1 | 9.9 | MIL-NCE |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@10 | 32.4 | MIL-NCE |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@5 | 24 | MIL-NCE |
| Zero-Shot Video Retrieval | YouCook2 | text-to-video Mean Rank | 10 | MIL-NCE |
| Zero-Shot Video Retrieval | YouCook2 | text-to-video R@1 | 15.1 | MIL-NCE |
| Zero-Shot Video Retrieval | YouCook2 | text-to-video R@10 | 51.2 | MIL-NCE |
| Zero-Shot Video Retrieval | YouCook2 | text-to-video R@5 | 38 | MIL-NCE |