Unsupervised Learning from Narrated Instruction Videos

Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien

2015-06-30CVPR 2016 6Clustering

Abstract

We address the problem of automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The contributions of this paper are three-fold. First, we develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method solves two clustering problems, one in text and one in video, applied one after each other and linked by joint constraints to obtain a single coherent sequence of steps in both modalities. Second, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains about 800,000 frames for five different tasks that include complex interactions between people and objects, and are captured in a variety of indoor and outdoor settings. Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.

Results

Task	Dataset	Metric	Value	Model
Video	CrossTask	Recall	13.3	Alayrac
Temporal Action Localization	CrossTask	Recall	13.3	Alayrac
Zero-Shot Learning	CrossTask	Recall	13.3	Alayrac
Action Localization	CrossTask	Recall	13.3	Alayrac

Related Papers

Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18 Ranking Vectors Clustering: Theory and Applications2025-07-16 Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework2025-07-11 GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning2025-07-09 Consistency and Inconsistency in $K$-Means Clustering2025-07-08 MC-INR: Efficient Encoding of Multivariate Scientific Simulation Data using Meta-Learning and Clustered Implicit Neural Representations2025-07-03 Supercm: Revisiting Clustering for Semi-Supervised Learning2025-06-30 Temporal Rate Reduction Clustering for Human Motion Segmentation2025-06-26