TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Tell Me What Happened: Unifying Text-guided Video Completi...

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su, William Yang Wang, Sean Bell

2022-11-23CVPR 2023 1Text-to-Video GenerationVideo PredictionVideo Generation
PaperPDFCode(official)

Abstract

Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.

Results

TaskDatasetMetricValueModel
VideoUCF-101FVD16328MMVG (128x128, class-conditional)
VideoUCF-101Inception Score73.7MMVG (128x128, class-conditional)
VideoUCF-101FVD16395MMVG (128x128, unconditional)
VideoUCF-101Inception Score58.3MMVG (128x128, unconditional)
VideoBAIR Robot PushingFVD85.2MMVG
Video PredictionBAIR Robot PushingFVD85.2MMVG
Video GenerationUCF-101FVD16328MMVG (128x128, class-conditional)
Video GenerationUCF-101Inception Score73.7MMVG (128x128, class-conditional)
Video GenerationUCF-101FVD16395MMVG (128x128, unconditional)
Video GenerationUCF-101Inception Score58.3MMVG (128x128, unconditional)
Text-to-Video GenerationMSR-VTTCLIPSIM0.2644MMVG
Text-to-Video GenerationMSR-VTTFID23.4MMVG

Related Papers

LoViC: Efficient Long Video Generation with Context Compression2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11Scaling RL to Long Videos2025-07-10Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions2025-07-10