TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Aug...

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

Wenxi Chen, Ziyang Ma, Xiquan Li, Xuenan Xu, Yuzhe Liang, Zhisheng Zheng, Kai Yu, Xie Chen

2024-10-12Speech RecognitionMachine Translationspeech-recognitionCaption Generationtext similarityAudio captioningTranslation
PaperPDFCode(official)

Abstract

Automated Audio Captioning (AAC) aims to generate natural textual descriptions for input audio signals. Recent progress in audio pre-trained models and large language models (LLMs) has significantly enhanced audio understanding and textual reasoning capabilities, making improvements in AAC possible. In this paper, we propose SLAM-AAC to further enhance AAC with paraphrasing augmentation and CLAP-Refine through LLMs. Our approach uses the self-supervised EAT model to extract fine-grained audio representations, which are then aligned with textual embeddings via lightweight linear layers. The caption generation LLM is efficiently fine-tuned using the LoRA adapter. Drawing inspiration from the back-translation method in machine translation, we implement paraphrasing augmentation to expand the Clotho dataset during pre-training. This strategy helps alleviate the limitation of scarce audio-text pairs and generates more diverse captions from a small set of audio clips. During inference, we introduce the plug-and-play CLAP-Refine strategy to fully exploit multiple decoding outputs, akin to the n-best rescoring strategy in speech recognition. Using the CLAP model for audio-text similarity calculation, we could select the textual descriptions generated by multiple searching beams that best match the input audio. Experimental results show that SLAM-AAC achieves state-of-the-art performance on Clotho V2 and AudioCaps, surpassing previous mainstream models.

Results

TaskDatasetMetricValueModel
Audio captioningClothoCIDEr0.515SLAM-AAC
Audio captioningClothoFENSE0.54SLAM-AAC
Audio captioningClothoMETEOR0.197SLAM-AAC
Audio captioningClothoSPICE0.148SLAM-AAC
Audio captioningClothoSPIDEr0.332SLAM-AAC
Audio captioningClothoSPIDEr-FL0.33SLAM-AAC
Audio captioningAudioCapsCIDEr0.841SLAM-AAC
Audio captioningAudioCapsFENSE0.668SLAM-AAC
Audio captioningAudioCapsMETEOR0.268SLAM-AAC
Audio captioningAudioCapsSPICE0.194SLAM-AAC
Audio captioningAudioCapsSPIDEr0.518SLAM-AAC
Audio captioningAudioCapsSPIDEr-FL0.515SLAM-AAC

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Function-to-Style Guidance of LLMs for Code Translation2025-07-15WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings2025-07-09GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning2025-07-09