TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Self-Supervised Audio-Visual Representation Learning with ...

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

Pritam Sarkar, Ali Etemad

2021-11-09Sound ClassificationAudio ClassificationSelf-Supervised LearningSelf-Supervised Audio ClassificationRetrievalSelf-supervised Video RetrievalSelf-Supervised Action Recognition
PaperPDFCode(official)

Abstract

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound. The codes and pretrained models are available on the project website.

Results

TaskDatasetMetricValueModel
Activity RecognitionUCF1013-fold Accuracy92.4CrissCross (AudioSet)
Activity RecognitionUCF1013-fold Accuracy91.5CrissCross (Kinetics400)
Activity RecognitionUCF1013-fold Accuracy88.3CrissCross (Kinetics-Sound)
Activity RecognitionHMDB51Top-1 Accuracy66.8CrissCross (AudioSet)
Activity RecognitionHMDB51Top-1 Accuracy64.7CrissCross (Kinetics400)
Activity RecognitionHMDB51Top-1 Accuracy60.5CrissCross (Kinetics-Sound)
Audio ClassificationDCASETop-1 Accuracy97CrissCross (AudioSet)
Audio ClassificationDCASETop-1 Accuracy96CrissCross (Kinetics-400)
Audio ClassificationDCASETop-1 Accuracy93CrissCross (Kinetics-Sound)
Action RecognitionUCF1013-fold Accuracy92.4CrissCross (AudioSet)
Action RecognitionUCF1013-fold Accuracy91.5CrissCross (Kinetics400)
Action RecognitionUCF1013-fold Accuracy88.3CrissCross (Kinetics-Sound)
Action RecognitionHMDB51Top-1 Accuracy66.8CrissCross (AudioSet)
Action RecognitionHMDB51Top-1 Accuracy64.7CrissCross (Kinetics400)
Action RecognitionHMDB51Top-1 Accuracy60.5CrissCross (Kinetics-Sound)
ClassificationDCASETop-1 Accuracy97CrissCross (AudioSet)
ClassificationDCASETop-1 Accuracy96CrissCross (Kinetics-400)
ClassificationDCASETop-1 Accuracy93CrissCross (Kinetics-Sound)

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16