TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/FLUX that Plays Music

FLUX that Plays Music

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang

2024-09-01Music GenerationText-to-Music Generation
PaperPDFCode(official)Code(official)

Abstract

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Flux\footnote{https://github.com/black-forest-labs/flux} model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: \url{https://github.com/feizc/FluxMusic}.

Results

TaskDatasetMetricValueModel
Text-to-Music GenerationMusicCapsFAD1.43FLUXMusic
Text-to-Music GenerationMusicCapsIS2.98FLUXMusic
Text-to-Music GenerationMusicCapsKL_passt1.25FLUXMusic

Related Papers

WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling2025-07-14MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation2025-07-08TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure2025-06-29Exploring Adapter Design Tradeoffs for Low Resource Music Generation2025-06-26Let Your Video Listen to Your Music!2025-06-23MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners2025-06-23Benchmarking Music Generation Models and Metrics via Human Preference Studies2025-06-23AI-Generated Song Detection via Lyrics Transcripts2025-06-23