TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Zero-shot audio captioning with audio-language model guida...

Zero-shot audio captioning with audio-language model guidance and audio context keywords

Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata

2023-11-14Speech RecognitionDescriptivespeech-recognitionZero-shot Audio CaptioningAudio captioningImage CaptioningLarge Language ModelLanguage Modelling
PaperPDFCode(official)

Abstract

Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without prior training for this task. Different from speech recognition which translates audio content that contains spoken language into text, audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an action. Inspired by zero-shot image captioning methods, we propose ZerAuCap, a novel framework for summarising such general audio signals in a text caption without requiring task-specific training. In particular, our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions that describe the audio content. Additionally, we use audio context keywords that prompt the language model to generate text that is broadly relevant to sounds. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets. Our code is available at https://github.com/ExplainableML/ZerAuCap.

Results

TaskDatasetMetricValueModel
Audio captioningAudioCapsBLEU-46.8ZerAuCap
Audio captioningAudioCapsCIDEr28.1ZerAuCap
Audio captioningAudioCapsMETEOR12.3ZerAuCap
Audio captioningAudioCapsROUGE-L33.1ZerAuCap
Audio captioningAudioCapsSPICE8.6ZerAuCap
Audio captioningAudioCapsSPIDEr18.3ZerAuCap
Audio captioningAudioCapsCIDEr0.1No audio (baseline)
Audio captioningAudioCapsMETEOR4.1No audio (baseline)
Audio captioningAudioCapsROUGE-L17.8No audio (baseline)
Audio captioningClothoBLEU-42.9ZerAuCap
Audio captioningClothoCIDEr14ZerAuCap
Audio captioningClothoMETEOR9.4ZerAuCap
Audio captioningClothoROUGE-L25.4ZerAuCap
Audio captioningClothoSPICE5.3ZerAuCap
Audio captioningClothoSPIDEr9.7ZerAuCap

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17