TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/End-to-End Learning of Visual Representations from Uncurat...

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman

2019-12-13CVPR 2020 6Action SegmentationVideo RetrievalAction LocalizationZero-Shot Video RetrievalLong Video Retrieval (Background Removed)Text to Video RetrievalAction RecognitionRetrieval
PaperPDFCodeCodeCode(official)Code

Abstract

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

Results

TaskDatasetMetricValueModel
Activity RecognitionRareActmWAP30.5HT100M S3D
Action LocalizationCOINFrame accuracy61MIL-NCE
Action LocalizationCOINFrame accuracy53.9CBT
Action RecognitionRareActmWAP30.5HT100M S3D
Action SegmentationCOINFrame accuracy61MIL-NCE
Action SegmentationCOINFrame accuracy53.9CBT
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@143.1MIL-NCE
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@1079.1MIL-NCE
Long Video Retrieval (Background Removed)YouCook2Cap. Avg. R@568.6MIL-NCE
Zero-Shot Video RetrievalMSR-VTTtext-to-video Mean Rank29.5MIL-NCE
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@19.9MIL-NCE
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@1032.4MIL-NCE
Zero-Shot Video RetrievalMSR-VTTtext-to-video R@524MIL-NCE
Zero-Shot Video RetrievalYouCook2text-to-video Mean Rank10MIL-NCE
Zero-Shot Video RetrievalYouCook2text-to-video R@115.1MIL-NCE
Zero-Shot Video RetrievalYouCook2text-to-video R@1051.2MIL-NCE
Zero-Shot Video RetrievalYouCook2text-to-video R@538MIL-NCE

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16