TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Long-Context Autoregressive Video Modeling with Next-Frame...

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

YuChao Gu, Weijia Mao, Mike Zheng Shou

2025-03-25Text GenerationVideo Generation
PaperPDFCode(official)

Abstract

Long-context video modeling is essential for enabling generative models to function as world simulators, as they must maintain temporal coherence over extended time spans. However, most existing models are trained on short clips, limiting their ability to capture long-range dependencies, even with test-time extrapolation. While training directly on long videos is a natural solution, the rapid growth of vision tokens makes it computationally prohibitive. To support exploring efficient long-context video modeling, we first establish a strong autoregressive baseline called Frame AutoRegressive (FAR). FAR models temporal dependencies between continuous frames, converges faster than video diffusion transformers, and outperforms token-level autoregressive models. Based on this baseline, we observe context redundancy in video autoregression. Nearby frames are critical for maintaining temporal consistency, whereas distant frames primarily serve as context memory. To eliminate this redundancy, we propose the long short-term context modeling using asymmetric patchify kernels, which apply large kernels to distant frames to reduce redundant tokens, and standard kernels to local frames to preserve fine-grained detail. This significantly reduces the training cost of long videos. Our method achieves state-of-the-art results on both short and long video generation, providing an effective baseline for long-context autoregressive video modeling.

Results

TaskDatasetMetricValueModel
VideoUCF-101FVD1657FAR
Video GenerationUCF-101FVD1657FAR

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15