Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro
It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Audio Generation | AudioCaps | CLAP_LAION | 0.527 | Tango-AF&AC-FT-AC |
| Audio Generation | AudioCaps | FAD | 2.54 | Tango-AF&AC-FT-AC |
| Audio Generation | AudioCaps | FD | 17.19 | Tango-AF&AC-FT-AC |
| Audio Generation | AudioCaps | IS | 11.04 | Tango-AF&AC-FT-AC |
| Text-to-Music Generation | MusicCaps | CLAP_LAION | 0.51 | TANGO-AF |
| Text-to-Music Generation | MusicCaps | CLAP_MS | 0.43 | TANGO-AF |
| Text-to-Music Generation | MusicCaps | FAD | 2.21 | TANGO-AF |
| Text-to-Music Generation | MusicCaps | FD | 22.69 | TANGO-AF |
| Text-to-Music Generation | MusicCaps | FD_openl3 | 270.32 | TANGO-AF |
| Text-to-Music Generation | MusicCaps | IS | 2.79 | TANGO-AF |
| Text-to-Music Generation | MusicCaps | KL_passt | 0.94 | TANGO-AF |