Metric: ROUGE-L (higher is better)
| # | Model↕ | ROUGE-L▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Audio Flamingo | 40.8 | Yes | Audio Flamingo: A Novel Audio Language Model wit... | 2024-02-02 | Code |
| 2 | ZerAuCap | 33.1 | Yes | Zero-shot audio captioning with audio-language m... | 2023-11-14 | Code |
| 3 | No audio (baseline) | 17.8 | No | Zero-shot audio captioning with audio-language m... | 2023-11-14 | Code |
| 4 | Shaharabany et al. | 8.2 | Yes | Zero-Shot Audio Captioning via Audibility Guidance | 2023-09-07 | - |
| 5 | AutoCap | 0.518 | No | Taming Data and Transformers for Audio Generation | 2024-06-27 | Code |
| 6 | LAVCap | 0.51 | No | LAVCap: LLM-based Audio-Visual Captioning using ... | 2025-01-16 | Code |
| 7 | VAST | 0.509 | Yes | VAST: A Vision-Audio-Subtitle-Text Omni-Modality... | 2023-05-29 | Code |
| 8 | Rethink-ACT (AST + TF + MIL) | 0.504 | No | - | - | - |
| 9 | VALOR | 0.494 | Yes | VALOR: Vision-Audio-Language Omni-Perception Pre... | 2023-04-17 | Code |