TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MAGREF: Masked Guidance for Any-Reference Video Generation

MAGREF: Masked Guidance for Any-Reference Video Generation

Yufan Deng, Xun Guo, Yuanyang Yin, Jacob Zhiyuan Fang, Yiding Yang, Yizhi Wang, Shenghai Yuan, Angtian Wang, Bo Liu, Haibin Huang, Chongyang Ma

2025-05-29Single-Domain Subject-to-VideoOpen-Domain Subject-to-VideoHuman-Domain Subject-to-VideoVideo Generation
PaperPDFCode(official)

Abstract

Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

Results

TaskDatasetMetricValueModel
VideoOpenS2V-EvalAesthetics0.4502MAGREF-480P
VideoOpenS2V-EvalFaceSim0.3083MAGREF-480P
VideoOpenS2V-EvalGmeScore0.7047MAGREF-480P
VideoOpenS2V-EvalMotion0.2181MAGREF-480P
VideoOpenS2V-EvalNaturalScore0.6949MAGREF-480P
VideoOpenS2V-EvalNexusScore0.4304MAGREF-480P
VideoOpenS2V-EvalTotal Score0.4793MAGREF-480P
Video GenerationOpenS2V-EvalAesthetics0.4502MAGREF-480P
Video GenerationOpenS2V-EvalFaceSim0.3083MAGREF-480P
Video GenerationOpenS2V-EvalGmeScore0.7047MAGREF-480P
Video GenerationOpenS2V-EvalMotion0.2181MAGREF-480P
Video GenerationOpenS2V-EvalNaturalScore0.6949MAGREF-480P
Video GenerationOpenS2V-EvalNexusScore0.4304MAGREF-480P
Video GenerationOpenS2V-EvalTotal Score0.4793MAGREF-480P
1 Image, 2*2 StitchiOpenS2V-EvalAesthetics0.4502MAGREF-480P
1 Image, 2*2 StitchiOpenS2V-EvalFaceSim0.3083MAGREF-480P
1 Image, 2*2 StitchiOpenS2V-EvalGmeScore0.7047MAGREF-480P
1 Image, 2*2 StitchiOpenS2V-EvalMotion0.2181MAGREF-480P
1 Image, 2*2 StitchiOpenS2V-EvalNaturalScore0.6949MAGREF-480P
1 Image, 2*2 StitchiOpenS2V-EvalNexusScore0.4304MAGREF-480P
1 Image, 2*2 StitchiOpenS2V-EvalTotal Score0.4793MAGREF-480P
Image to Video GenerationOpenS2V-EvalAesthetics0.4502MAGREF-480P
Image to Video GenerationOpenS2V-EvalFaceSim0.3083MAGREF-480P
Image to Video GenerationOpenS2V-EvalGmeScore0.7047MAGREF-480P
Image to Video GenerationOpenS2V-EvalMotion0.2181MAGREF-480P
Image to Video GenerationOpenS2V-EvalNaturalScore0.6949MAGREF-480P
Image to Video GenerationOpenS2V-EvalNexusScore0.4304MAGREF-480P
Image to Video GenerationOpenS2V-EvalTotal Score0.4793MAGREF-480P

Related Papers

World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective2025-07-11Scaling RL to Long Videos2025-07-10Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions2025-07-10