TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Look, Listen and Learn

Look, Listen and Learn

Relja Arandjelović, Andrew Zisserman

2017-05-23ICCV 2017 10Sound ClassificationAudio ClassificationGeneral Classification
PaperPDFCode

Abstract

We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.

Results

TaskDatasetMetricValueModel
Audio ClassificationESC-50Top-1 Accuracy79.3L3
Audio ClassificationAudioSetTest mAP0.249L3
ClassificationESC-50Top-1 Accuracy79.3L3
ClassificationAudioSetTest mAP0.249L3

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24USAD: Universal Speech and Audio Representation via Distillation2025-06-23Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment2025-06-17Disentangling Dual-Encoder Masked Autoencoder for Respiratory Sound Classification2025-06-12MUDAS: Mote-scale Unsupervised Domain Adaptation in Multi-label Sound Classification2025-06-12