TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Enhancing Automated Audio Captioning via Large Language Mo...

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

2024-06-19Audio captioning
PaperPDFCode(official)

Abstract

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.

Results

TaskDatasetMetricValueModel
Audio captioningClothoCIDEr0.513LOAE
Audio captioningClothoFENSE0.538LOAE
Audio captioningClothoMETEOR0.197LOAE
Audio captioningClothoSPICE0.147LOAE
Audio captioningClothoSPIDEr0.33LOAE
Audio captioningClothoSPIDEr-FL0.33LOAE
Audio captioningClothoSentence-BERT0.538LOAE
Audio captioningAudioCapsCIDEr0.816LOAE
Audio captioningAudioCapsFENSE0.664LOAE
Audio captioningAudioCapsMETEOR0.267LOAE
Audio captioningAudioCapsSPICE0.193LOAE
Audio captioningAudioCapsSPIDEr0.505LOAE
Audio captioningAudioCapsSentence-BERT0.664LOAE

Related Papers

video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18AC/DC: LLM-based Audio Comprehension via Dialogue Continuation2025-06-12FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion2025-06-01CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer2025-06-01Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning2025-05-28TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining2025-05-12M2D2: Exploring General-purpose Audio-Language Representations Beyond CLAP2025-03-28Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context2025-03-19