Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, RuiQi Li, Zhou Zhao

2024-06-01Video-to-Sound Generation Audio Generation

Abstract

Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io.

Results

Task	Dataset	Metric	Value	Model
Audio Generation	VGG-Sound	FAD	1.32	Frieren
Audio Generation	VGG-Sound	FD	12.26	Frieren

Related Papers

FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11 ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing2025-06-26 Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance2025-06-26 Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation2025-06-24 LiLAC: A Lightweight Latent ControlNet for Musical Audio Generation2025-06-13 ViSAGe: Video-to-Spatial Audio Generation2025-06-13 BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation2025-06-11 A Survey of Automatic Evaluation Methods on Text, Visual and Speech Generations2025-06-06