Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

2024-06-19Audio captioning

Abstract

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.

Results

Task	Dataset	Metric	Value	Model
Audio captioning	Clotho	CIDEr	0.513	LOAE
Audio captioning	Clotho	FENSE	0.538	LOAE
Audio captioning	Clotho	METEOR	0.197	LOAE
Audio captioning	Clotho	SPICE	0.147	LOAE
Audio captioning	Clotho	SPIDEr	0.33	LOAE
Audio captioning	Clotho	SPIDEr-FL	0.33	LOAE
Audio captioning	Clotho	Sentence-BERT	0.538	LOAE
Audio captioning	AudioCaps	CIDEr	0.816	LOAE
Audio captioning	AudioCaps	FENSE	0.664	LOAE
Audio captioning	AudioCaps	METEOR	0.267	LOAE
Audio captioning	AudioCaps	SPICE	0.193	LOAE
Audio captioning	AudioCaps	SPIDEr	0.505	LOAE
Audio captioning	AudioCaps	Sentence-BERT	0.664	LOAE

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Abstract

Results

Related Papers

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Abstract

Results

Related Papers