Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Relation Extraction | Vinoground | Group Score | 1.2 | VideoCLIP |
| Relation Extraction | Vinoground | Text Score | 17 | VideoCLIP |
| Relation Extraction | Vinoground | Video Score | 2.8 | VideoCLIP |
| Video | CrossTask | Recall | 47.3 | VideoCLIP |
| Video | MSR-VTT-1kA | text-to-video R@1 | 30.9 | VideoCLIP |
| Video | MSR-VTT-1kA | text-to-video R@10 | 66.8 | VideoCLIP |
| Video | MSR-VTT-1kA | text-to-video R@5 | 55.4 | VideoCLIP |
| Video | YouCook2 | text-to-video R@1 | 32.2 | VideoCLIP |
| Video | YouCook2 | text-to-video R@10 | 75 | VideoCLIP |
| Video | YouCook2 | text-to-video R@5 | 62.6 | VideoCLIP |
| Video | YouCook2 | text-to-video R@1 | 22.7 | VideoCLIP (zero-shot) |
| Video | YouCook2 | text-to-video R@10 | 63.1 | VideoCLIP (zero-shot) |
| Video | YouCook2 | text-to-video R@5 | 50.4 | VideoCLIP (zero-shot) |
| Temporal Action Localization | CrossTask | Recall | 47.3 | VideoCLIP |
| Zero-Shot Learning | CrossTask | Recall | 47.3 | VideoCLIP |
| Action Localization | CrossTask | Recall | 47.3 | VideoCLIP |
| Action Localization | COIN | Frame accuracy | 68.7 | VideoClip |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@1 | 30.9 | VideoCLIP |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@10 | 66.8 | VideoCLIP |
| Video Retrieval | MSR-VTT-1kA | text-to-video R@5 | 55.4 | VideoCLIP |
| Video Retrieval | YouCook2 | text-to-video R@1 | 32.2 | VideoCLIP |
| Video Retrieval | YouCook2 | text-to-video R@10 | 75 | VideoCLIP |
| Video Retrieval | YouCook2 | text-to-video R@5 | 62.6 | VideoCLIP |
| Video Retrieval | YouCook2 | text-to-video R@1 | 22.7 | VideoCLIP (zero-shot) |
| Video Retrieval | YouCook2 | text-to-video R@10 | 63.1 | VideoCLIP (zero-shot) |
| Video Retrieval | YouCook2 | text-to-video R@5 | 50.4 | VideoCLIP (zero-shot) |
| Action Segmentation | COIN | Frame accuracy | 68.7 | VideoClip |
| Temporal Relation Extraction | Vinoground | Group Score | 1.2 | VideoCLIP |
| Temporal Relation Extraction | Vinoground | Text Score | 17 | VideoCLIP |
| Temporal Relation Extraction | Vinoground | Video Score | 2.8 | VideoCLIP |
| Long Video Retrieval (Background Removed) | YouCook2 | Cap. Avg. R@1 | 74.5 | VideoCLIP |
| Long Video Retrieval (Background Removed) | YouCook2 | Cap. Avg. R@10 | 97.9 | VideoCLIP |
| Long Video Retrieval (Background Removed) | YouCook2 | Cap. Avg. R@5 | 94.5 | VideoCLIP |
| Long Video Retrieval (Background Removed) | YouCook2 | DTW R@1 | 56 | VideoCLIP |
| Long Video Retrieval (Background Removed) | YouCook2 | DTW R@10 | 89.9 | VideoCLIP |
| Long Video Retrieval (Background Removed) | YouCook2 | DTW R@5 | 96.3 | VideoCLIP |
| Long Video Retrieval (Background Removed) | YouCook2 | OTAM R@1 | 52.8 | VideoCLIP |
| Long Video Retrieval (Background Removed) | YouCook2 | OTAM R@10 | 89.2 | VideoCLIP |
| Long Video Retrieval (Background Removed) | YouCook2 | OTAM R@5 | 95 | VideoCLIP |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@1 | 10.4 | VideoCLIP |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@10 | 30 | VideoCLIP |
| Zero-Shot Video Retrieval | MSR-VTT | text-to-video R@5 | 22.2 | VideoCLIP |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@1 | 16.6 | VideoCLIP |
| Zero-Shot Video Retrieval | DiDeMo | text-to-video R@5 | 46.9 | VideoCLIP |
| Zero-Shot Video Retrieval | YouCook2 | text-to-video R@1 | 22.7 | VideoCLIP |
| Zero-Shot Video Retrieval | YouCook2 | text-to-video R@10 | 63.1 | VideoCLIP |
| Zero-Shot Video Retrieval | YouCook2 | text-to-video R@5 | 50.4 | VideoCLIP |