Audio captioning

20 benchmarks119 papers

Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation metrics (BLEU, METEOR, ROUGE) and image captioning metrics (SPICE, CIDER) are used, they are not very well-suited. Attempts have been made to use pretrained language model based metrics such as Sentence-BERT.

Benchmarks

Audio captioning on AudioCaps

CIDEr SPIDEr SPICE METEOR BLEU-4 ROUGE-L FENSE SPIDEr-FL #params (M)ROUGE Sentence-BERT

Audio captioning on Clotho

CIDEr SPIDEr SPICE METEOR BLEU-4 ROUGE-L FENSE SPIDEr-FL Sentence-BERT