Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Audio Generation | AudioCaps | CLAP_LAION | 0.35 | Stable Audio Open |
| Audio Generation | AudioCaps | CLAP_MS | 0.34 | Stable Audio Open |
| Audio Generation | AudioCaps | FD_openl3 | 78.24 | Stable Audio Open |
| Audio Generation | AudioCaps | KL_passt | 2.14 | Stable Audio Open |
| Text-to-Music Generation | MusicCaps | CLAP_LAION | 0.48 | Stable Audio Open |
| Text-to-Music Generation | MusicCaps | CLAP_MS | 0.49 | Stable Audio Open |
| Text-to-Music Generation | MusicCaps | FAD | 3.51 | Stable Audio Open |
| Text-to-Music Generation | MusicCaps | FD | 36.42 | Stable Audio Open |
| Text-to-Music Generation | MusicCaps | FD_openl3 | 127.2 | Stable Audio Open |
| Text-to-Music Generation | MusicCaps | IS | 2.93 | Stable Audio Open |
| Text-to-Music Generation | MusicCaps | KL_passt | 1.32 | Stable Audio Open |