TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MAGVIT: Masked Generative Video Transformer

MAGVIT: Masked Generative Video Transformer

Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang

2022-12-10CVPR 2023 1Text-to-Video GenerationVideo PredictionMulti-Task LearningVideo Generation
PaperPDFCode(official)

Abstract

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

Results

TaskDatasetMetricValueModel
VideoKinetics-600 12 frames, 64x64FVD9.9MAGVIT
VideoUCF-101FVD16265MAGVIT (AR)
VideoBAIR Robot PushingCond1MAGVIT
VideoBAIR Robot PushingFVD score62MAGVIT
VideoBAIR Robot PushingPred15MAGVIT
VideoBAIR Robot PushingTrain15MAGVIT
VideoKinetics-600 12 frames, 64x64Cond5MAGVIT (-L-FP)
VideoKinetics-600 12 frames, 64x64Pred11MAGVIT (-L-FP)
VideoKinetics-600 12 frames, 64x64Cond5MAGVIT (-B-FP)
VideoKinetics-600 12 frames, 64x64Pred11MAGVIT (-B-FP)
VideoSomething-Something V2FVD28.5MAGVIT
Video PredictionKinetics-600 12 frames, 64x64Cond5MAGVIT (-L-FP)
Video PredictionKinetics-600 12 frames, 64x64Pred11MAGVIT (-L-FP)
Video PredictionKinetics-600 12 frames, 64x64Cond5MAGVIT (-B-FP)
Video PredictionKinetics-600 12 frames, 64x64Pred11MAGVIT (-B-FP)
Video PredictionSomething-Something V2FVD28.5MAGVIT
Video GenerationKinetics-600 12 frames, 64x64FVD9.9MAGVIT
Video GenerationUCF-101FVD16265MAGVIT (AR)
Video GenerationBAIR Robot PushingCond1MAGVIT
Video GenerationBAIR Robot PushingFVD score62MAGVIT
Video GenerationBAIR Robot PushingPred15MAGVIT
Video GenerationBAIR Robot PushingTrain15MAGVIT
Text-to-Video GenerationSomething-Something V2FVD79.1MAGVIT

Related Papers

LoViC: Efficient Long Video Generation with Context Compression2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17Robust-Multi-Task Gradient Boosting2025-07-15$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11