TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Stable Audio Open

Stable Audio Open

Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

2024-07-19Text-to-Music GenerationAudio Generation
PaperPDFCode(official)

Abstract

Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

Results

TaskDatasetMetricValueModel
Audio GenerationAudioCapsCLAP_LAION0.35Stable Audio Open
Audio GenerationAudioCapsCLAP_MS0.34Stable Audio Open
Audio GenerationAudioCapsFD_openl378.24Stable Audio Open
Audio GenerationAudioCapsKL_passt2.14Stable Audio Open
Text-to-Music GenerationMusicCapsCLAP_LAION0.48Stable Audio Open
Text-to-Music GenerationMusicCapsCLAP_MS0.49Stable Audio Open
Text-to-Music GenerationMusicCapsFAD3.51Stable Audio Open
Text-to-Music GenerationMusicCapsFD36.42Stable Audio Open
Text-to-Music GenerationMusicCapsFD_openl3127.2Stable Audio Open
Text-to-Music GenerationMusicCapsIS2.93Stable Audio Open
Text-to-Music GenerationMusicCapsKL_passt1.32Stable Audio Open

Related Papers

FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing2025-06-26Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance2025-06-26Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation2025-06-24MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners2025-06-23Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models2025-06-18LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation2025-06-13ViSAGe: Video-to-Spatial Audio Generation2025-06-13