TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Self-Supervised Learning by Cross-Modal Audio-Video Cluste...

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran

2019-11-28NeurIPS 2020 12Deep ClusteringRepresentation LearningAudio ClassificationSelf-Supervised LearningSelf-Supervised Audio ClassificationClusteringAction RecognitionSelf-Supervised Action Recognition
PaperPDFCode(official)

Abstract

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

Results

TaskDatasetMetricValueModel
Activity RecognitionUCF101 (finetuned)3-fold Accuracy95.5XDC
Activity RecognitionHMDB51Top-1 Accuracy68.9XDC
Activity RecognitionHMDB51Top-1 Accuracy66.5XDC
Activity RecognitionHMDB51Top-1 Accuracy63.7XDC
Activity RecognitionHMDB51Top-1 Accuracy52.6XDC
Activity RecognitionHMDB51 (finetuned)Top-1 Accuracy68.9XDC
Audio ClassificationESC-50Top-1 Accuracy85.4XDC
Audio ClassificationESC-50Top-1 Accuracy84.8XDC
Audio ClassificationDCASETop-1 Accuracy95XDC
Audio ClassificationDCASETop-1 Accuracy95XDC
Action RecognitionUCF101 (finetuned)3-fold Accuracy95.5XDC
Action RecognitionHMDB51Top-1 Accuracy68.9XDC
Action RecognitionHMDB51Top-1 Accuracy66.5XDC
Action RecognitionHMDB51Top-1 Accuracy63.7XDC
Action RecognitionHMDB51Top-1 Accuracy52.6XDC
Action RecognitionHMDB51 (finetuned)Top-1 Accuracy68.9XDC
ClassificationESC-50Top-1 Accuracy85.4XDC
ClassificationESC-50Top-1 Accuracy84.8XDC
ClassificationDCASETop-1 Accuracy95XDC
ClassificationDCASETop-1 Accuracy95XDC

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17