Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Audio
/
Audio captioning
/
AudioCaps
Audio captioning on AudioCaps
Metric: CIDEr (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
CIDEr (best first)
CIDEr (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
CIDEr
▼
Extra Data
Paper
Date
↕
Code
1
Audio Flamingo
50.2
Yes
Audio Flamingo: A Novel Audio Language Model wit...
2024-02-02
Code
2
ZerAuCap
28.1
Yes
Zero-shot audio captioning with audio-language m...
2023-11-14
Code
3
Shaharabany et al.
9.2
Yes
Zero-Shot Audio Captioning via Audibility Guidance
2023-09-07
-
4
LAVCap
0.849
No
LAVCap: LLM-based Audio-Visual Captioning using ...
2025-01-16
Code
5
MQ-Cap
0.845
Yes
Enhancing Retrieval-Augmented Audio Captioning w...
2024-10-14
-
6
SLAM-AAC
0.841
Yes
SLAM-AAC: Enhancing Audio Captioning with Paraph...
2024-10-12
Code
7
AutoCap
0.832
No
Taming Data and Transformers for Audio Generation
2024-06-27
Code
8
EnCLAP++-large
0.823
Yes
EnCLAP++: Analyzing the EnCLAP Framework for Opt...
2024-09-02
Code
9
LOAE
0.816
Yes
Enhancing Automated Audio Captioning via Large L...
2024-06-19
Code
10
EnCLAP++-base
0.815
Yes
EnCLAP++: Analyzing the EnCLAP Framework for Opt...
2024-09-02
Code
11
CNext-trans
0.8061
No
-
-
-
12
EnCLAP-large
0.8029
No
EnCLAP: Combining Neural Audio Codec and Audio-T...
2024-01-31
Code
13
VAST
0.781
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
14
EnCLAP-base
0.7795
No
EnCLAP: Combining Neural Audio Codec and Audio-T...
2024-01-31
Code
15
AL-MixGen + Multi-TTA
0.769
No
-
-
-
16
Rethink-ACT (AST + TF + MIL)
0.764
No
-
-
-
17
AL-MixGen
0.755
No
Exploring Train and Test-Time Augmentations for ...
2022-10-31
-
18
BART + YAMNet + PANNs
0.753
No
-
-
Code
19
VALOR
0.741
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
20
CNN+Transformer
0.693
No
Audio Captioning Transformer
2021-07-21
Code
21
TopDown-AlignedAtt (1NN)
0.593
No
-
-
-
22
Audio Flamingo (4-shot)
0.518
Yes
Audio Flamingo: A Novel Audio Language Model wit...
2024-02-02
Code
23
RECAP (4-shot)
0.359
No
RECAP: Retrieval-Augmented Audio Captioning
2023-09-18
Code
24
Prefix tuning for automated audio captioning
0.211
No
Prefix tuning for automated audio captioning
2023-03-30
Code
25
Audio captioning transformer
0.149
No
Audio Captioning Transformer
2021-07-21
Code
26
Automated audio captioning by fine-tuning bart with audioset tags
0.147
No
-
-
Code
27
No audio (baseline)
0.1
No
Zero-shot audio captioning with audio-language m...
2023-11-14
Code
#1
Audio Flamingo
SOTA
50.2
CIDEr
· Extra Data
· 2024-02-02
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Code
#2
ZerAuCap
SOTA
28.1
CIDEr
· Extra Data
· 2023-11-14
Zero-shot audio captioning with audio-language model guidance and audio context keywords
Code
#3
Shaharabany et al.
SOTA
9.2
CIDEr
· Extra Data
· 2023-09-07
Zero-Shot Audio Captioning via Audibility Guidance
#4
LAVCap
0.849
CIDEr
· 2025-01-16
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
Code
#5
MQ-Cap
0.845
CIDEr
· Extra Data
· 2024-10-14
Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning
#6
SLAM-AAC
0.841
CIDEr
· Extra Data
· 2024-10-12
SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs
Code
#7
AutoCap
0.832
CIDEr
· 2024-06-27
Taming Data and Transformers for Audio Generation
Code
#8
EnCLAP++-large
0.823
CIDEr
· Extra Data
· 2024-09-02
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
Code
#9
LOAE
0.816
CIDEr
· Extra Data
· 2024-06-19
Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
Code
#10
EnCLAP++-base
0.815
CIDEr
· Extra Data
· 2024-09-02
EnCLAP++: Analyzing the EnCLAP Framework for Optimizing Automated Audio Captioning Performance
Code
#11
CNext-trans
0.8061
CIDEr
No paper
#12
EnCLAP-large
0.8029
CIDEr
· 2024-01-31
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Code
#13
VAST
SOTA
0.781
CIDEr
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#14
EnCLAP-base
0.7795
CIDEr
· 2024-01-31
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Code
#15
AL-MixGen + Multi-TTA
0.769
CIDEr
No paper
#16
Rethink-ACT (AST + TF + MIL)
0.764
CIDEr
No paper
#17
AL-MixGen
SOTA
0.755
CIDEr
· 2022-10-31
Exploring Train and Test-Time Augmentations for Audio-Language Learning
#18
BART + YAMNet + PANNs
0.753
CIDEr
No paper
Code
#19
VALOR
0.741
CIDEr
· Extra Data
· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Code
#20
CNN+Transformer
SOTA
0.693
CIDEr
· 2021-07-21
Audio Captioning Transformer
Code
#21
TopDown-AlignedAtt (1NN)
0.593
CIDEr
No paper
#22
Audio Flamingo (4-shot)
0.518
CIDEr
· Extra Data
· 2024-02-02
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Code
#23
RECAP (4-shot)
0.359
CIDEr
· 2023-09-18
RECAP: Retrieval-Augmented Audio Captioning
Code
#24
Prefix tuning for automated audio captioning
0.211
CIDEr
· 2023-03-30
Prefix tuning for automated audio captioning
Code
#25
Audio captioning transformer
0.149
CIDEr
· 2021-07-21
Audio Captioning Transformer
Code
#26
Automated audio captioning by fine-tuning bart with audioset tags
0.147
CIDEr
No paper
Code
#27
No audio (baseline)
0.1
CIDEr
· 2023-11-14
Zero-shot audio captioning with audio-language model guidance and audio context keywords
Code