TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Methodology/Classification/VGGSound

Classification on VGGSound

Metric: Top 1 Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Top 1 Accuracy▼AugmentationsPaperDate↕Code
1Mirasol3B69.8NoMirasol3B: A Multimodal Autoregressive model for...2023-11-09-
2CA2ST(B/16)68.3NoCA^2ST: Cross-Attention in Audio, Space, and Tim...2025-03-30-
3ONE-PEACE (Audio-Visual)68.2YesONE-PEACE: Exploring One General Representation ...2023-05-18Code
4CAVA(B/16)68.2NoCA^2ST: Cross-Attention in Audio, Space, and Tim...2025-03-30-
5MAViL67.1Yes---
6EquiAV67.1YesEquiAV: Leveraging Equivariance for Audio-Visual...2024-03-14Code
7MMT (Audio-Visual)66.2No---
8CAV-MAE (Audio-Visual)65.9YesContrastive Audio-Visual Masked Autoencoder2022-10-02Code
9UAVM (Audio + Video)65.8YesUAVM: Towards Unifying Audio and Visual Models2022-07-29Code
10Audiovisual Masked Autoencoder (Audiovisual, Single)65NoAudiovisual Masked Autoencoders2022-12-09Code
11AVT (Audio-Visual)63.9No---
12ONE-PEACE (Audio-Only)59.6YesONE-PEACE: Exploring One General Representation ...2023-05-18Code
13CAV-MAE (Audio-Only)59.5YesContrastive Audio-Visual Masked Autoencoder2022-10-02Code
14Audiovisual Masked Autoencoder (Audio-only, Single)57.2NoAudiovisual Masked Autoencoders2022-12-09Code
15MAST (Audio Only)57NoMultiscale Audio Spectrogram Transformer for Eff...2023-03-19-
16UAVM (Audio Only)56.5YesUAVM: Towards Unifying Audio and Visual Models2022-07-29Code
17MMT (Video)56.1No---
18PlayItBackX353.7NoPlay It Back: Iterative Attention for Audio Reco...2022-10-20Code
19AVT (V)53.2No---
20MBT (A)52.3NoAttention Bottlenecks for Multimodal Fusion2021-06-30Code
21MBT (V)51.2NoAttention Bottlenecks for Multimodal Fusion2021-06-30Code
22UAVM (Video Only)49.9YesUAVM: Towards Unifying Audio and Visual Models2022-07-29Code