VIDM: Video Implicit Diffusion Models

Kangfu Mei, Vishal M. Patel

2022-12-01Video Generation

Abstract

Diffusion models have emerged as a powerful generative method for synthesizing high-quality and diverse set of images. In this paper, we propose a video generation method based on diffusion models, where the effects of motion are modeled in an implicit condition manner, i.e. one can sample plausible video motions according to the latent feature of frames. We improve the quality of the generated videos by proposing multiple strategies such as sampling space truncation, robustness penalty, and positional group normalization. Various experiments are conducted on datasets consisting of videos with different resolutions and different number of frames. Results show that the proposed method outperforms the state-of-the-art generative adversarial network-based methods by a significant margin in terms of FVD scores as well as perceptible visual quality.

Results

Task	Dataset	Metric	Value	Model
Video	UCF-101	FVD128	1531.9	VIDM (256x256, unconditional)
Video	UCF-101	FVD16	294.7	VIDM (256x256, unconditional)
Video Generation	UCF-101	FVD128	1531.9	VIDM (256x256, unconditional)
Video Generation	UCF-101	FVD16	294.7	VIDM (256x256, unconditional)

Related Papers

World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17 Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17 Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17 LoViC: Efficient Long Video Generation with Context Compression2025-07-17 $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12 Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11 Scaling RL to Long Videos2025-07-10 Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions2025-07-10