TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Quality-aware Masked Diffusion Transformer for Enhanced Mu...

Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

Chang Li, Ruoyu Wang, Lijuan Liu, Jun Du, Yixuan Sun, Zilu Guo, Zhenrong Zhang, Yuan Jiang, Jianqing Gao, Feng Ma

2024-05-24Music GenerationText-to-Music Generation
PaperPDFCode(official)Code(official)

Abstract

Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the advancement of music generation models. To address these challenges, we propose a novel quality-aware training paradigm for generating high-quality, high-musicality music from large-scale, quality-imbalanced datasets. Additionally, by leveraging unique properties in the latent space of musical signals, we adapt and implement a masked diffusion transformer (MDT) model for the TTM task, showcasing its capacity for quality control and enhanced musicality. Furthermore, we introduce a three-stage caption refinement approach to address low-quality captions' issue. Experiments show state-of-the-art (SOTA) performance on benchmark datasets including MusicCaps and the Song-Describer Dataset with both objective and subjective metrics. Demo audio samples are available at https://qa-mdt.github.io/, code and pretrained checkpoints are open-sourced at https://github.com/ivcylc/OpenMusic.

Results

TaskDatasetMetricValueModel
Music GenerationSong Describer DatasetFAD VGG1.01OpenMusic
Text-to-Music GenerationMusicCapsFAD1.65OpenMusic (QA-MDT)
Text-to-Music GenerationMusicCapsIS2.8OpenMusic (QA-MDT)
Text-to-Music GenerationMusicCapsKL_passt1.31OpenMusic (QA-MDT)
1 Image, 2*2 StitchiSong Describer DatasetFAD VGG1.01OpenMusic

Related Papers

WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling2025-07-14MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation2025-07-08TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure2025-06-29Exploring Adapter Design Tradeoffs for Low Resource Music Generation2025-06-26Let Your Video Listen to Your Music!2025-06-23MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners2025-06-23Benchmarking Music Generation Models and Metrics via Human Preference Studies2025-06-23AI-Generated Song Detection via Lyrics Transcripts2025-06-23