Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran

2019-11-28NeurIPS 2020 12Deep Clustering Representation Learning Audio Classification Self-Supervised Learning Self-Supervised Audio Classification Clustering Action Recognition Self-Supervised Action Recognition

Paper PDF Code(official)

Abstract

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	UCF101 (finetuned)	3-fold Accuracy	95.5	XDC
Activity Recognition	HMDB51	Top-1 Accuracy	68.9	XDC
Activity Recognition	HMDB51	Top-1 Accuracy	66.5	XDC
Activity Recognition	HMDB51	Top-1 Accuracy	63.7	XDC
Activity Recognition	HMDB51	Top-1 Accuracy	52.6	XDC
Activity Recognition	HMDB51 (finetuned)	Top-1 Accuracy	68.9	XDC
Audio Classification	ESC-50	Top-1 Accuracy	85.4	XDC
Audio Classification	ESC-50	Top-1 Accuracy	84.8	XDC
Audio Classification	DCASE	Top-1 Accuracy	95	XDC
Audio Classification	DCASE	Top-1 Accuracy	95	XDC
Action Recognition	UCF101 (finetuned)	3-fold Accuracy	95.5	XDC
Action Recognition	HMDB51	Top-1 Accuracy	68.9	XDC
Action Recognition	HMDB51	Top-1 Accuracy	66.5	XDC
Action Recognition	HMDB51	Top-1 Accuracy	63.7	XDC
Action Recognition	HMDB51	Top-1 Accuracy	52.6	XDC
Action Recognition	HMDB51 (finetuned)	Top-1 Accuracy	68.9	XDC
Classification	ESC-50	Top-1 Accuracy	85.4	XDC
Classification	ESC-50	Top-1 Accuracy	84.8	XDC
Classification	DCASE	Top-1 Accuracy	95	XDC
Classification	DCASE	Top-1 Accuracy	95	XDC

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Abstract

Results

Related Papers

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Abstract

Results

Related Papers