TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Auffusion: Leveraging the Power of Diffusion and Large Lan...

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li

2024-01-02Style TransferAudio Generationcross-modal alignment
PaperPDFCode(official)

Abstract

Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning AIGC application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resource. Furthermore, previous studies in T2I recognizes the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments of text-audio alignment in TTA. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which further demonstrated in several related tasks, such as audio style transfer, inpainting and other manipulations. Our implementation and demos are available at https://auffusion.github.io.

Results

TaskDatasetMetricValueModel
Audio GenerationAudioCapsFAD1.63Auffusion
Audio GenerationAudioCapsFD21.99Auffusion
Audio GenerationAudioCapsFAD1.76Auffusion-Full
Audio GenerationAudioCapsFD23.08Auffusion-Full

Related Papers

Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17CATVis: Context-Aware Thought Visualization2025-07-15Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection2025-07-15Transferring Styles for Reduced Texture Bias and Improved Robustness in Semantic Segmentation Networks2025-07-14FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09Skywork-R1V3 Technical Report2025-07-08RSRefSeg 2: Decoupling Referring Remote Sensing Image Segmentation with Foundation Models2025-07-08