Video-GPT via Next Clip Diffusion

Shaobin Zhuang, Zhipeng Huang, Ying Zhang, Fangyikang Wang, Canmiao Fu, Binxin Yang, Chong Sun, Chen Li, Yali Wang

2025-05-18Denoising Video Retrieval Video Prediction Prediction Video Object Segmentation Video Classification Video Generation

Paper PDF Code

Abstract

GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such details. Motivated by this fact, we propose a concise Video-GPT in this paper by treating video as new language for visual world modeling. By analogy to next token prediction in GPT, we introduce a novel next clip diffusion paradigm for pretraining Video-GPT. Different from the previous works, this distinct paradigm allows Video-GPT to tackle both short-term generation and long-term prediction, by autoregressively denoising the noisy clip according to the clean clips in the history. Extensive experiments show our Video-GPT achieves the state-of-the-art performance on video prediction, which is the key factor towards world modeling (Physics-IQ Benchmark: Video-GPT 34.97 vs. Kling 23.64 vs. Wan 20.89). Moreover, it can be well adapted on 6 mainstream video tasks in both video generation and understanding, showing its great generalization capacity in downstream. The project page is at https://zhuangshaobin.github.io/Video-GPT.github.io/.

Results

Task	Dataset	Metric	Value	Model
Video	UCF-101	FVD16	53	Video-GPT
Video Generation	UCF-101	FVD16	53	Video-GPT

Related Papers

Multi-Strategy Improved Snake Optimizer Accelerated CNN-LSTM-Attention-Adaboost for Trajectory Prediction2025-07-21 SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21 fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17 Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17 World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17 Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17 Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17 LoViC: Efficient Long Video Generation with Context Compression2025-07-17