Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning

Choi Changin, Lim Sungjun, Rhee Wonjong

2024-10-14Audio captioning Retrieval RAG

Abstract

Retrieval-augmented generation can improve audio captioning by incorporating relevant audio-text pairs from a knowledge base. Existing methods typically rely solely on the input audio as a unimodal retrieval query. In contrast, we propose Generation-Assisted Multimodal Querying, which generates a text description of the input audio to enable multimodal querying. This approach aligns the query modality with the audio-text structure of the knowledge base, leading to more effective retrieval. Furthermore, we introduce a novel progressive learning strategy that gradually increases the number of interleaved audio-text pairs to enhance the training process. Our experiments on AudioCaps, Clotho, and Auto-ACD demonstrate that our approach achieves state-of-the-art results across these benchmarks.

Results

Task	Dataset	Metric	Value	Model
Audio captioning	Clotho	BLEU-4	18.1	MQ-Cap
Audio captioning	Clotho	CIDEr	0.496	MQ-Cap
Audio captioning	Clotho	METEOR	0.192	MQ-Cap
Audio captioning	Clotho	SPICE	0.143	MQ-Cap
Audio captioning	Clotho	SPIDEr	0.319	MQ-Cap
Audio captioning	AudioCaps	BLEU-4	0.301	MQ-Cap
Audio captioning	AudioCaps	CIDEr	0.845	MQ-Cap
Audio captioning	AudioCaps	METEOR	0.266	MQ-Cap
Audio captioning	AudioCaps	SPICE	0.194	MQ-Cap
Audio captioning	AudioCaps	SPIDEr	0.519	MQ-Cap

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 A Survey of Context Engineering for Large Language Models2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17 Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Context-Aware Search and Retrieval Over Erasure Channels2025-07-16 Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15