TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Taming Data and Transformers for Audio Generation

Taming Data and Transformers for Audio Generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, Vicente Ordonez

2024-06-27arXiv 2024 6Audio GenerationAudio captioningAudio Synthesis
PaperPDFCode

Abstract

The scalability of ambient sound generators is hindered by data scarcity, insufficient caption quality, and limited scalability in model architecture. This work addresses these challenges by advancing both data and model scaling. First, we propose an efficient and scalable dataset collection pipeline tailored for ambient audio generation, resulting in AutoReCap-XL, the largest ambient audio-text dataset with over 47 million clips. To provide high-quality textual annotations, we propose AutoCap, a high-quality automatic audio captioning model. By adopting a Q-Former module and leveraging audio metadata, AutoCap substantially enhances caption quality, reaching a CIDEr score of $83.2$, a $3.2\%$ improvement over previous captioning models. Finally, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters. We demonstrate its benefits from data scaling with synthetic captions as well as model size scaling. When compared to baseline audio generators trained at similar size and data scale, GenAu obtains significant improvements of $4.7\%$ in FAD score, $11.1\%$ in IS, and $13.5\%$ in CLAP score. Our code, model checkpoints, and dataset are publicly available.

Results

TaskDatasetMetricValueModel
Audio GenerationAudioCapsCLAP_MS0.668GenAu-Large
Audio GenerationAudioCapsFAD1.21GenAu-Large
Audio GenerationAudioCapsFD16.51GenAu-Large
Audio captioningAudioCapsCIDEr0.832AutoCap
Audio captioningAudioCapsMETEOR0.253AutoCap
Audio captioningAudioCapsROUGE0.518AutoCap
Audio captioningAudioCapsROUGE-L0.518AutoCap
Audio captioningAudioCapsSPICE0.182AutoCap
Audio captioningAudioCapsSPIDEr0.507AutoCap

Related Papers

FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling2025-07-11ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing2025-06-26Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance2025-06-26Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation2025-06-24video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation2025-06-13ViSAGe: Video-to-Spatial Audio Generation2025-06-13