Look, Listen and Learn

Relja Arandjelović, Andrew Zisserman

2017-05-23ICCV 2017 10Sound Classification Audio Classification General Classification

Abstract

We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.

Results

Task	Dataset	Metric	Value	Model
Audio Classification	ESC-50	Top-1 Accuracy	79.3	L3
Audio Classification	AudioSet	Test mAP	0.249	L3
Classification	ESC-50	Top-1 Accuracy	79.3	L3
Classification	AudioSet	Test mAP	0.249	L3

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24 USAD: Universal Speech and Audio Representation via Distillation2025-06-23 Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23 Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment2025-06-17 Disentangling Dual-Encoder Masked Autoencoder for Respiratory Sound Classification2025-06-12 MUDAS: Mote-scale Unsupervised Domain Adaptation in Multi-label Sound Classification2025-06-12