TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UAVM: Towards Unifying Audio and Visual Models

UAVM: Towards Unifying Audio and Visual Models

Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

2022-07-29Multi-modal ClassificationAudio Classificationaudio-visual learning
PaperPDFCode(official)

Abstract

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.

Results

TaskDatasetMetricValueModel
Audio ClassificationAudioSetTest mAP0.504UAVM (Audio + Video)
Audio ClassificationVGGSoundTop 1 Accuracy65.8UAVM (Audio + Video)
Audio ClassificationVGGSoundTop 1 Accuracy56.5UAVM (Audio Only)
Audio ClassificationVGGSoundTop 1 Accuracy49.9UAVM (Video Only)
ClassificationAudioSetTest mAP0.504UAVM (Audio + Video)
ClassificationVGGSoundTop 1 Accuracy65.8UAVM (Audio + Video)
ClassificationVGGSoundTop 1 Accuracy56.5UAVM (Audio Only)
ClassificationVGGSoundTop 1 Accuracy49.9UAVM (Video Only)
ClassificationVGG-SoundTop-1 Accuracy65.8UAVM
ClassificationAudioSetAverage mAP0.504UAVM
Multi-modal ClassificationVGG-SoundTop-1 Accuracy65.8UAVM
Multi-modal ClassificationAudioSetAverage mAP0.504UAVM

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework2025-06-09Adaptive Differential Denoising for Respiratory Sounds Classification2025-06-03Spectrotemporal Modulation: Efficient and Interpretable Feature Representation for Classifying Speech, Music, and Environmental Sounds2025-05-29A Survey on Training-free Open-Vocabulary Semantic Segmentation2025-05-28