TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Audio Flamingo: A Novel Audio Language Model with Few-Shot...

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro

2024-02-02Few-Shot LearningZero-shot Audio CaptioningAudio captioningRetrievalRetrieval-augmented Few-shot In-context Audio CaptioningLanguage ModellingAcoustic Scene Classification
PaperPDFCode(official)

Abstract

Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) strong multi-turn dialogue abilities. We introduce a series of training techniques, architecture design, and data strategies to enhance our model with these abilities. Extensive evaluations across various audio understanding tasks confirm the efficacy of our method, setting new state-of-the-art benchmarks. Our demo website is https://audioflamingo.github.io/ and the code is open-sourced at https://github.com/NVIDIA/audio-flamingo.

Results

TaskDatasetMetricValueModel
Acoustic Scene ClassificationCochlScene1:1 Accuracy0.83Audio Flamingo
Audio captioningClothoBLEU-417.4Audio Flamingo (Pengi trainset)
Audio captioningClothoCIDEr0.489Audio Flamingo (Pengi trainset)
Audio captioningClothoMETEOR18.7Audio Flamingo (Pengi trainset)
Audio captioningClothoROUGE-L39.4Audio Flamingo (Pengi trainset)
Audio captioningClothoSPICE0.134Audio Flamingo (Pengi trainset)
Audio captioningClothoSPIDEr0.312Audio Flamingo (Pengi trainset)
Audio captioningAudioCapsCIDEr0.518Audio Flamingo (4-shot)
Audio captioningAudioCapsBLEU-414.3Audio Flamingo
Audio captioningAudioCapsCIDEr50.2Audio Flamingo
Audio captioningAudioCapsMETEOR20.5Audio Flamingo
Audio captioningAudioCapsROUGE-L40.8Audio Flamingo
Audio captioningAudioCapsSPICE15.1Audio Flamingo
Audio captioningAudioCapsSPIDEr32.6Audio Flamingo

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21GLAD: Generalizable Tuning for Vision-Language Models2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17