Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Audio
/
Audio captioning
/
AudioCaps
Audio captioning on AudioCaps
Metric: SPICE (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
SPICE (best first)
SPICE (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
SPICE
▼
Extra Data
Paper
Date
↕
Code
1
Audio Flamingo
15.1
Yes
Audio Flamingo: A Novel Audio Language Model wit...
2024-02-02
Code
2
ZerAuCap
8.6
Yes
Zero-shot audio captioning with audio-language m...
2023-11-14
Code
3
EnCLAP++-large
0.197
Yes
EnCLAP++: Analyzing the EnCLAP Framework for Opt...
2024-09-02
Code
4
MQ-Cap
0.194
Yes
Enhancing Retrieval-Augmented Audio Captioning w...
2024-10-14
-
5
SLAM-AAC
0.194
Yes
SLAM-AAC: Enhancing Audio Captioning with Paraph...
2024-10-12
Code
6
LOAE
0.193
Yes
Enhancing Automated Audio Captioning via Large L...
2024-06-19
Code
7
EnCLAP++-base
0.188
Yes
EnCLAP++: Analyzing the EnCLAP Framework for Opt...
2024-09-02
Code
8
EnCLAP-large
0.1879
No
EnCLAP: Combining Neural Audio Codec and Audio-T...
2024-01-31
Code
9
EnCLAP-base
0.1863
No
EnCLAP: Combining Neural Audio Codec and Audio-T...
2024-01-31
Code
10
LAVCap
0.185
No
LAVCap: LLM-based Audio-Visual Captioning using ...
2025-01-16
Code
11
CNext-trans
0.1841
No
-
-
-
12
AutoCap
0.182
No
Taming Data and Transformers for Audio Generation
2024-06-27
Code
13
AL-MixGen + Multi-TTA
0.181
No
-
-
-
14
Rethink-ACT (AST + TF + MIL)
0.18
No
-
-
-
15
AL-MixGen
0.177
No
Exploring Train and Test-Time Augmentations for ...
2022-10-31
-
16
BART + YAMNet + PANNs
0.176
No
-
-
Code
17
CNN+Transformer
0.159
No
Audio Captioning Transformer
2021-07-21
Code
18
TopDown-AlignedAtt (1NN)
0.144
No
-
-
-
19
No audio (baseline)
0
No
-
-
Code
#1
Audio Flamingo
SOTA
15.1
SPICE
· Extra Data
· 2024-02-02
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Code
#2
ZerAuCap
SOTA
8.6
SPICE
· Extra Data
· 2023-11-14
Zero-shot audio captioning with audio-language model guidance and audio context keywords
Code
#3
EnCLAP++-large
0.197
SPICE
· Extra Data
· 2024-09-02
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
Code
#4
MQ-Cap
0.194
SPICE
· Extra Data
· 2024-10-14
Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning
#5
SLAM-AAC
0.194
SPICE
· Extra Data
· 2024-10-12
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Code
#6
LOAE
0.193
SPICE
· Extra Data
· 2024-06-19
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
Code
#7
EnCLAP++-base
0.188
SPICE
· Extra Data
· 2024-09-02
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
Code
#8
EnCLAP-large
0.1879
SPICE
· 2024-01-31
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Code
#9
EnCLAP-base
0.1863
SPICE
· 2024-01-31
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Code
#10
LAVCap
0.185
SPICE
· 2025-01-16
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
Code
#11
CNext-trans
0.1841
SPICE
No paper
#12
AutoCap
0.182
SPICE
· 2024-06-27
Taming Data and Transformers for Audio Generation
Code
#13
AL-MixGen + Multi-TTA
0.181
SPICE
No paper
#14
Rethink-ACT (AST + TF + MIL)
0.18
SPICE
No paper
#15
AL-MixGen
SOTA
0.177
SPICE
· 2022-10-31
Exploring Train and Test-Time Augmentations for Audio-Language Learning
#16
BART + YAMNet + PANNs
0.176
SPICE
No paper
Code
#17
CNN+Transformer
SOTA
0.159
SPICE
· 2021-07-21
Audio Captioning Transformer
Code
#18
TopDown-AlignedAtt (1NN)
0.144
SPICE
No paper
#19
No audio (baseline)
0
SPICE
No paper
Code