CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang

2022-05-29Text-to-Video Generation Video Generation

Abstract

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

Results

Task	Dataset	Metric	Value	Model
Video	UCF-101	FVD16	305	CogVideo (128x128, class-conditional)
Video	UCF-101	Inception Score	51.11	CogVideo (128x128, class-conditional)
Video Generation	UCF-101	FVD16	305	CogVideo (128x128, class-conditional)
Video Generation	UCF-101	Inception Score	51.11	CogVideo (128x128, class-conditional)

Related Papers

LoViC: Efficient Long Video Generation with Context Compression2025-07-17 World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17 Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17 Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17 $I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12 Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11 Scaling RL to Long Videos2025-07-10 Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions2025-07-10