Metric: R@1 (higher is better)
| # | Model↕ | R@1▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | PaSST-RoBERTa & Estimated Audio–Caption Correspondences | 27.69 | Yes | Estimated Audio-Caption Correspondences Improve ... | 2024-08-21 | Code |
| 2 | InternVideo2-6B | 27.2 | Yes | InternVideo2: Scaling Foundation Models for Mult... | 2024-03-22 | Code |
| 3 | VAST | 26.9 | Yes | VAST: A Vision-Audio-Subtitle-Text Omni-Modality... | 2023-05-29 | Code |
| 4 | PaSST–RoBERTa & GPT-augment | 26.07 | Yes | Advancing Natural-Language Based Audio Retrieval... | 2023-08-08 | Code |
| 5 | ONE-PEACE | 22.4 | Yes | ONE-PEACE: Exploring One General Representation ... | 2023-05-18 | Code |
| 6 | VALOR | 17.5 | Yes | VALOR: Vision-Audio-Language Omni-Perception Pre... | 2023-04-17 | Code |