Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-600 12 frames, 64x64 | FVD | 9.9 | MAGVIT |
| Video | UCF-101 | FVD16 | 265 | MAGVIT (AR) |
| Video | BAIR Robot Pushing | Cond | 1 | MAGVIT |
| Video | BAIR Robot Pushing | FVD score | 62 | MAGVIT |
| Video | BAIR Robot Pushing | Pred | 15 | MAGVIT |
| Video | BAIR Robot Pushing | Train | 15 | MAGVIT |
| Video | Kinetics-600 12 frames, 64x64 | Cond | 5 | MAGVIT (-L-FP) |
| Video | Kinetics-600 12 frames, 64x64 | Pred | 11 | MAGVIT (-L-FP) |
| Video | Kinetics-600 12 frames, 64x64 | Cond | 5 | MAGVIT (-B-FP) |
| Video | Kinetics-600 12 frames, 64x64 | Pred | 11 | MAGVIT (-B-FP) |
| Video | Something-Something V2 | FVD | 28.5 | MAGVIT |
| Video Prediction | Kinetics-600 12 frames, 64x64 | Cond | 5 | MAGVIT (-L-FP) |
| Video Prediction | Kinetics-600 12 frames, 64x64 | Pred | 11 | MAGVIT (-L-FP) |
| Video Prediction | Kinetics-600 12 frames, 64x64 | Cond | 5 | MAGVIT (-B-FP) |
| Video Prediction | Kinetics-600 12 frames, 64x64 | Pred | 11 | MAGVIT (-B-FP) |
| Video Prediction | Something-Something V2 | FVD | 28.5 | MAGVIT |
| Video Generation | Kinetics-600 12 frames, 64x64 | FVD | 9.9 | MAGVIT |
| Video Generation | UCF-101 | FVD16 | 265 | MAGVIT (AR) |
| Video Generation | BAIR Robot Pushing | Cond | 1 | MAGVIT |
| Video Generation | BAIR Robot Pushing | FVD score | 62 | MAGVIT |
| Video Generation | BAIR Robot Pushing | Pred | 15 | MAGVIT |
| Video Generation | BAIR Robot Pushing | Train | 15 | MAGVIT |
| Text-to-Video Generation | Something-Something V2 | FVD | 79.1 | MAGVIT |