Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Soujanya Poria

2023-04-24Audio Generation

Abstract

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.

Results

Task	Dataset	Metric	Value	Model
Audio Generation	AudioCaps	FAD	1.59	TANGO
Audio Generation	AudioCaps	FD	24.52	TANGO

Related Papers

FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11 ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing2025-06-26 Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance2025-06-26 Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation2025-06-24 LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation2025-06-13 ViSAGe: Video-to-Spatial Audio Generation2025-06-13 BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation2025-06-11 A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations2025-06-06