| 1 | OmniVec2 | 0.558 | Yes | - | - | - |
| 2 | OmniVec | 0.548 | Yes | OmniVec: Learning robust representations with cr... | 2023-11-07 | - |
| 3 | EquiAV | 0.546 | No | EquiAV: Leveraging Equivariance for Audio-Visual... | 2024-03-14 | Code |
| 4 | MAViL (Audio-Visual, single) | 0.533 | Yes | - | - | - |
| 5 | Audiovisual Masked Autoencoder (Audiovisual, Single) | 0.518 | No | Audiovisual Masked Autoencoders | 2022-12-09 | Code |
| 6 | CAV-MAE (Audio-Visual) | 0.512 | Yes | Contrastive Audio-Visual Masked Autoencoder | 2022-10-02 | Code |
| 7 | BEATs (Audio-only, Ensemble) | 0.506 | No | BEATs: Audio Pre-Training with Acoustic Tokenizers | 2022-12-18 | Code |
| 8 | UAVM (Audio + Video) | 0.504 | Yes | UAVM: Towards Unifying Audio and Visual Models | 2022-07-29 | Code |
| 9 | SSLAM (Audio-Only, Single) | 0.502 | No | SSLAM: Enhancing Self-Supervised Models with Aud... | 2025-06-13 | Code |
| 10 | mn40_as (Ensemble) | 0.498 | Yes | Efficient Large-scale Audio Tagging via Transfor... | 2022-11-09 | Code |
| 11 | ATST-C2F(Single) | 0.497 | No | Self-supervised Audio Teacher-Student Transforme... | 2023-06-07 | Code |
| 12 | MBT (AS-500K training + Video) | 0.496 | Yes | Attention Bottlenecks for Multimodal Fusion | 2021-06-30 | Code |
| 13 | PaSST (Ensemble) | 0.496 | Yes | Efficient Training of Audio Transformers with Pa... | 2021-10-11 | Code |
| 14 | DyMN-L (Audio-Only, Single) | 0.49 | Yes | Dynamic Convolutional Neural Networks as Efficie... | 2023-10-24 | Code |
| 15 | M2D2 | 0.49 | No | M2D2: Exploring General-purpose Audio-Language R... | 2025-03-28 | Code |
| 16 | HTS-AT (Ensemble) | 0.487 | Yes | HTS-AT: A Hierarchical Token-Semantic Audio Tran... | 2022-02-02 | Code |
| 17 | BEATs (Audio-only, Single) | 0.486 | No | BEATs: Audio Pre-Training with Acoustic Tokenizers | 2022-12-18 | Code |
| 18 | EAT | 0.486 | No | EAT: Self-Supervised Pre-Training with Efficient... | 2024-01-07 | Code |
| 19 | DTF-AT (Single) | 0.486 | No | - | - | Code |
| 20 | AST (Ensemble) | 0.485 | Yes | AST: Audio Spectrogram Transformer | 2021-04-05 | Code |
| 21 | M2D-CLAP/0.7 | 0.485 | No | M2D-CLAP: Masked Modeling Duo Meets CLAP for Lea... | 2024-06-04 | Code |
| 22 | M2D-AS/0.7 | 0.485 | No | Masked Modeling Duo: Towards a Universal Audio P... | 2024-04-09 | Code |
| 23 | MAViL (Audio-only, single) | 0.484 | Yes | - | - | - |
| 24 | mn40_as (Single) | 0.483 | Yes | Efficient Large-scale Audio Tagging via Transfor... | 2022-11-09 | Code |
| 25 | MAX-AST (Single) | 0.481 | No | - | - | Code |
| 26 | ATST-Frame | 0.48 | No | Self-supervised Audio Teacher-Student Transforme... | 2023-06-07 | Code |
| 27 | M2D/0.7 | 0.479 | No | Masked Modeling Duo: Towards a Universal Audio P... | 2024-04-09 | Code |
| 28 | PlayItBackX3 | 0.477 | No | Play It Back: Iterative Attention for Audio Reco... | 2022-10-20 | Code |
| 29 | DASS-Medium (Audio-only, single) | 0.476 | No | DASS: Distilled Audio State Space Models Are Str... | 2024-07-04 | Code |
| 30 | PSLA (Ensemble) | 0.474 | Yes | PSLA: Improving Audio Tagging with Pretraining, ... | 2021-02-02 | Code |
| 31 | DASS-Small (Audio-only, single) | 0.472 | No | DASS: Distilled Audio State Space Models Are Str... | 2024-07-04 | Code |
| 32 | PaSST-S (Single) | 0.471 | Yes | Efficient Training of Audio Transformers with Pa... | 2021-10-11 | Code |
| 33 | MaskSpec (AS-2M) | 0.471 | No | - | - | - |
| 34 | CAV-MAE (Audio-Only) | 0.466 | Yes | Contrastive Audio-Visual Masked Autoencoder | 2022-10-02 | Code |
| 35 | Audiovisual Masked Autoencoder (Audio-only, Single) | 0.466 | No | Audiovisual Masked Autoencoders | 2022-12-09 | Code |
| 36 | AudioVisual Fusion Net | 0.462 | No | Large Scale Audiovisual Learning of Sounds with ... | 2020-05-29 | - |
| 37 | AST (Single) | 0.459 | Yes | AST: Audio Spectrogram Transformer | 2021-04-05 | Code |
| 38 | ERANN-1-6 | 0.45 | No | - | - | - |
| 39 | Perceiver | 0.449 | No | Perceiver: General Perception with Iterative Att... | 2021-03-04 | Code |
| 40 | PSLA (Single) | 0.443 | Yes | PSLA: Improving Audio Tagging with Pretraining, ... | 2021-02-02 | Code |
| 41 | PANNs-CNN14 (Single) | 0.431 | No | - | - | Code |
| 42 | EAT-M | 0.426 | No | End-to-End Audio Strikes Back: Boosting Augmenta... | 2022-04-25 | Code |
| 43 | Conformer (AS-2M) | 0.411 | No | Conformer-Based Self-Supervised Learning for Non... | 2021-10-14 | - |
| 44 | EAT-S | 0.405 | No | End-to-End Audio Strikes Back: Boosting Augmenta... | 2022-04-25 | Code |
| 45 | WEANet-SUSTAIN | 0.398 | No | A Sequential Self Teaching Approach for Improvin... | 2020-06-30 | - |
| 46 | VATT-Base | 0.394 | Yes | VATT: Transformers for Multimodal Self-Supervise... | 2021-04-22 | Code |
| 47 | Multi-Format Contrastive | 0.376 | No | Multi-Format Contrastive Learning of Audio Repre... | 2021-03-11 | - |
| 48 | MMV | 0.309 | No | Self-Supervised MultiModal Versatile Networks | 2020-06-29 | Code |
| 49 | CAV-MAE (Visual-Only) | 0.262 | Yes | Contrastive Audio-Visual Masked Autoencoder | 2022-10-02 | Code |
| 50 | L3 | 0.249 | No | Look, Listen and Learn | 2017-05-23 | Code |
| 51 | Triplet | 0.244 | No | Unsupervised Learning of Semantic Audio Represen... | 2017-11-06 | - |