TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/AudioLDM 2: Learning Holistic Audio Generation with Self-s...

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, Mark D. Plumbley

2023-08-10Representation LearningText-to-Music GenerationAudio GenerationText to Speechtext-to-speech
PaperPDFCode(official)Code

Abstract

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at https://audioldm.github.io/audioldm2.

Results

TaskDatasetMetricValueModel
Audio GenerationAudioCapsCLAP_LAION0.53AudioLDM2-large
Audio GenerationAudioCapsCLAP_MS0.37AudioLDM2-large
Audio GenerationAudioCapsFAD2.02AudioLDM2-large
Audio GenerationAudioCapsFD26.18AudioLDM2-large
Audio GenerationAudioCapsFD_openl3158.04AudioLDM2-large
Audio GenerationAudioCapsIS8.55AudioLDM2-large
Audio GenerationAudioCapsKL_passt1.68AudioLDM2-large
Audio GenerationAudioCapsCLAP_LAION0.243AudioLDM 2-AC-Large
Audio GenerationAudioCapsFAD1.42AudioLDM 2-AC-Large
Text-to-Music GenerationMusicCapsCLAP_LAION0.48AudioLDM2-large
Text-to-Music GenerationMusicCapsCLAP_MS0.47AudioLDM2-large
Text-to-Music GenerationMusicCapsFAD2.93AudioLDM2-large
Text-to-Music GenerationMusicCapsFD16.34AudioLDM2-large
Text-to-Music GenerationMusicCapsFD_openl3190.16AudioLDM2-large
Text-to-Music GenerationMusicCapsIS2.59AudioLDM2-large
Text-to-Music GenerationMusicCapsKL_passt1AudioLDM2-large
Text-to-Music GenerationMusicCapsFAD3.13AudioLDM 2-Full
Text-to-Music GenerationMusicCapsKL_passt1.2AudioLDM 2-Full
Text-to-Music GenerationMusicCapsFD_openl3354.05AudioLDM2-music
Text-to-Music GenerationMusicCapsKL_passt1.53AudioLDM2-music

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Hear Your Code Fail, Voice-Assisted Debugging for Python2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16