TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Towards End-to-End Generative Modeling of Long Videos with...

Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers

Jaehoon Yoo, Semin Kim, Doyup Lee, Chiheon Kim, Seunghoon Hong

2023-03-20CVPR 2023 1Video Generation
PaperPDFCode(official)

Abstract

Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive Transformers for generating moderately long videos in both quality and speed. Videos and code are available at https://sites.google.com/view/mebt-cvpr2023 .

Results

TaskDatasetMetricValueModel
VideoUCF-101FVD128968MeBT (128x128, unconditional)
VideoUCF-101FVD16438MeBT (128x128, unconditional)
VideoUCF-101Inception Score65.93MeBT (128x128, unconditional)
Video GenerationUCF-101FVD128968MeBT (128x128, unconditional)
Video GenerationUCF-101FVD16438MeBT (128x128, unconditional)
Video GenerationUCF-101Inception Score65.93MeBT (128x128, unconditional)

Related Papers

World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11Scaling RL to Long Videos2025-07-10Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions2025-07-10