LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

Yu Cheng, Fajie Yuan

2025-03-18Video Reconstruction Video Generation

Abstract

Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code are available at https://github.com/westlake-repl/LeanVAE

Results

Task	Dataset	Metric	Value	Model
Video	UCF-101	FVD16	164.45	Latte + LeanVAE
Video	Sky Time-lapse	FVD 16	49.59	Latte + LeanVAE
Video Generation	UCF-101	FVD16	164.45	Latte + LeanVAE
Video Generation	Sky Time-lapse	FVD 16	49.59	Latte + LeanVAE

Related Papers

World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17 Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17 Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17 LoViC: Efficient Long Video Generation with Context Compression2025-07-17 $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12 Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11 Scaling RL to Long Videos2025-07-10 Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions2025-07-10