MAGVIT: Masked Generative Video Transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang

2022-12-10CVPR 2023 1Text-to-Video Generation Video Prediction Multi-Task Learning Video Generation

Paper PDF Code(official)

Abstract

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-600 12 frames, 64x64	FVD	9.9	MAGVIT
Video	UCF-101	FVD16	265	MAGVIT (AR)
Video	BAIR Robot Pushing	Cond	1	MAGVIT
Video	BAIR Robot Pushing	FVD score	62	MAGVIT
Video	BAIR Robot Pushing	Pred	15	MAGVIT
Video	BAIR Robot Pushing	Train	15	MAGVIT
Video	Kinetics-600 12 frames, 64x64	Cond	5	MAGVIT (-L-FP)
Video	Kinetics-600 12 frames, 64x64	Pred	11	MAGVIT (-L-FP)
Video	Kinetics-600 12 frames, 64x64	Cond	5	MAGVIT (-B-FP)
Video	Kinetics-600 12 frames, 64x64	Pred	11	MAGVIT (-B-FP)
Video	Something-Something V2	FVD	28.5	MAGVIT
Video Prediction	Kinetics-600 12 frames, 64x64	Cond	5	MAGVIT (-L-FP)
Video Prediction	Kinetics-600 12 frames, 64x64	Pred	11	MAGVIT (-L-FP)
Video Prediction	Kinetics-600 12 frames, 64x64	Cond	5	MAGVIT (-B-FP)
Video Prediction	Kinetics-600 12 frames, 64x64	Pred	11	MAGVIT (-B-FP)
Video Prediction	Something-Something V2	FVD	28.5	MAGVIT
Video Generation	Kinetics-600 12 frames, 64x64	FVD	9.9	MAGVIT
Video Generation	UCF-101	FVD16	265	MAGVIT (AR)
Video Generation	BAIR Robot Pushing	Cond	1	MAGVIT
Video Generation	BAIR Robot Pushing	FVD score	62	MAGVIT
Video Generation	BAIR Robot Pushing	Pred	15	MAGVIT
Video Generation	BAIR Robot Pushing	Train	15	MAGVIT
Text-to-Video Generation	Something-Something V2	FVD	79.1	MAGVIT

MAGVIT: Masked Generative Video Transformer

Abstract

Results

Related Papers

MAGVIT: Masked Generative Video Transformer

Abstract

Results

Related Papers