TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Text-to-Audio Generation using Instruction-Tuned LLM and L...

Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Soujanya Poria

2023-04-24Audio Generation
PaperPDFCode(official)

Abstract

The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation -- a task where the goal is to generate an audio from its textual description. The prior works on TTA either pre-trained a joint text-audio encoder or used a non-instruction-tuned model, such as, T5. Consequently, our latent diffusion model (LDM)-based approach TANGO outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest on AudioCaps test set, despite training the LDM on a 63 times smaller dataset and keeping the text encoder frozen. This improvement might also be attributed to the adoption of audio pressure level-based sound mixing for training set augmentation, whereas the prior methods take a random mix.

Results

TaskDatasetMetricValueModel
Audio GenerationAudioCapsFAD1.59TANGO
Audio GenerationAudioCapsFD24.52TANGO

Related Papers

FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing2025-06-26Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance2025-06-26Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation2025-06-24LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation2025-06-13ViSAGe: Video-to-Spatial Audio Generation2025-06-13BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation2025-06-11A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations2025-06-06