Learning from Video and Text via Large-Scale Discriminative Clustering

Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, Josef Sivic

2017-07-27ICCV 2017 10Video Retrieval Clustering Video Alignment Action Recognition Temporal Action Localization

Abstract

Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks. Such applications include person and action recognition, text-to-video alignment, object co-segmentation and colocalization in videos and images. One drawback of discriminative clustering, however, is its limited scalability. We address this issue and propose an online optimization algorithm based on the Block-Coordinate Frank-Wolfe algorithm. We apply the proposed method to the problem of weakly supervised learning of actions and actors from movies together with corresponding movie scripts. The scaling up of the learning problem to 66 feature length movies enables us to significantly improve weakly supervised action recognition.

Results

Task	Dataset	Metric	Value	Model
Video	LSMDC	text-to-video Median Rank	52	Large-Scale Discriminative Clustering
Video	LSMDC	text-to-video R@1	7.3	Large-Scale Discriminative Clustering
Video	LSMDC	text-to-video R@10	27.1	Large-Scale Discriminative Clustering
Video	LSMDC	text-to-video R@5	19.2	Large-Scale Discriminative Clustering
Video Retrieval	LSMDC	text-to-video Median Rank	52	Large-Scale Discriminative Clustering
Video Retrieval	LSMDC	text-to-video R@1	7.3	Large-Scale Discriminative Clustering
Video Retrieval	LSMDC	text-to-video R@10	27.1	Large-Scale Discriminative Clustering
Video Retrieval	LSMDC	text-to-video R@5	19.2	Large-Scale Discriminative Clustering

Related Papers

Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Ranking Vectors Clustering: Theory and Applications2025-07-16 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework2025-07-11 GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning2025-07-09 Consistency and Inconsistency in $K$-Means Clustering2025-07-08 MC-INR: Efficient Encoding of Multivariate Scientific Simulation Data using Meta-Learning and Clustered Implicit Neural Representations2025-07-03