TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Cooperative Learning of Audio and Video Models from Self-S...

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Bruno Korbar, Du Tran, Lorenzo Torresani

2018-06-30NeurIPS 2018 12Audio ClassificationSelf-Supervised Audio ClassificationAction RecognitionTemporal Action LocalizationSelf-Supervised Action Recognition
PaperPDF

Abstract

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

Results

TaskDatasetMetricValueModel
Activity RecognitionUCF101 (finetuned)3-fold Accuracy89AVTS
Activity RecognitionHMDB51 (finetuned)Top-1 Accuracy61.6AVTS
Audio ClassificationESC-50Top-1 Accuracy82.3AVTS
Action RecognitionUCF101 (finetuned)3-fold Accuracy89AVTS
Action RecognitionHMDB51 (finetuned)Top-1 Accuracy61.6AVTS
ClassificationESC-50Top-1 Accuracy82.3AVTS

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25