FLUX that Plays Music

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang

2024-09-01Music Generation Text-to-Music Generation

Abstract

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Flux\footnote{https://github.com/black-forest-labs/flux} model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: \url{https://github.com/feizc/FluxMusic}.

Results

Task	Dataset	Metric	Value	Model
Text-to-Music Generation	MusicCaps	FAD	1.43	FLUXMusic
Text-to-Music Generation	MusicCaps	IS	2.98	FLUXMusic
Text-to-Music Generation	MusicCaps	KL_passt	1.25	FLUXMusic

Related Papers

WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling2025-07-14 MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation2025-07-08 TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure2025-06-29 Exploring Adapter Design Tradeoffs for Low Resource Music Generation2025-06-26 Let Your Video Listen to Your Music!2025-06-23 MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners2025-06-23 Benchmarking Music Generation Models and Metrics via Human Preference Studies2025-06-23 AI-Generated Song Detection via Lyrics Transcripts2025-06-23