TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Phantom: Subject-consistent video generation via cross-mod...

Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu

2025-02-16Single-Domain Subject-to-VideoOpen-Domain Subject-to-VideoHuman-Domain Subject-to-Videocross-modal alignmentVideo Generation
PaperPDFCode

Abstract

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.

Results

TaskDatasetMetricValueModel
VideoOpenS2V-EvalAesthetics0.4639Phantom-Wan-14B
VideoOpenS2V-EvalFaceSim0.5148Phantom-Wan-14B
VideoOpenS2V-EvalGmeScore0.7065Phantom-Wan-14B
VideoOpenS2V-EvalMotion0.3342Phantom-Wan-14B
VideoOpenS2V-EvalNaturalScore0.6866Phantom-Wan-14B
VideoOpenS2V-EvalNexusScore0.3743Phantom-Wan-14B
VideoOpenS2V-EvalTotal Score0.5232Phantom-Wan-14B
VideoOpenS2V-EvalAesthetics0.4667Phantom-Wan-1.3B
VideoOpenS2V-EvalFaceSim0.4855Phantom-Wan-1.3B
VideoOpenS2V-EvalGmeScore0.6942Phantom-Wan-1.3B
VideoOpenS2V-EvalMotion0.1429Phantom-Wan-1.3B
VideoOpenS2V-EvalNaturalScore0.7026Phantom-Wan-1.3B
VideoOpenS2V-EvalNexusScore0.4244Phantom-Wan-1.3B
VideoOpenS2V-EvalTotal Score0.5071Phantom-Wan-1.3B
Video GenerationOpenS2V-EvalAesthetics0.4639Phantom-Wan-14B
Video GenerationOpenS2V-EvalFaceSim0.5148Phantom-Wan-14B
Video GenerationOpenS2V-EvalGmeScore0.7065Phantom-Wan-14B
Video GenerationOpenS2V-EvalMotion0.3342Phantom-Wan-14B
Video GenerationOpenS2V-EvalNaturalScore0.6866Phantom-Wan-14B
Video GenerationOpenS2V-EvalNexusScore0.3743Phantom-Wan-14B
Video GenerationOpenS2V-EvalTotal Score0.5232Phantom-Wan-14B
Video GenerationOpenS2V-EvalAesthetics0.4667Phantom-Wan-1.3B
Video GenerationOpenS2V-EvalFaceSim0.4855Phantom-Wan-1.3B
Video GenerationOpenS2V-EvalGmeScore0.6942Phantom-Wan-1.3B
Video GenerationOpenS2V-EvalMotion0.1429Phantom-Wan-1.3B
Video GenerationOpenS2V-EvalNaturalScore0.7026Phantom-Wan-1.3B
Video GenerationOpenS2V-EvalNexusScore0.4244Phantom-Wan-1.3B
Video GenerationOpenS2V-EvalTotal Score0.5071Phantom-Wan-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalAesthetics0.4639Phantom-Wan-14B
1 Image, 2*2 StitchiOpenS2V-EvalFaceSim0.5148Phantom-Wan-14B
1 Image, 2*2 StitchiOpenS2V-EvalGmeScore0.7065Phantom-Wan-14B
1 Image, 2*2 StitchiOpenS2V-EvalMotion0.3342Phantom-Wan-14B
1 Image, 2*2 StitchiOpenS2V-EvalNaturalScore0.6866Phantom-Wan-14B
1 Image, 2*2 StitchiOpenS2V-EvalNexusScore0.3743Phantom-Wan-14B
1 Image, 2*2 StitchiOpenS2V-EvalTotal Score0.5232Phantom-Wan-14B
1 Image, 2*2 StitchiOpenS2V-EvalAesthetics0.4667Phantom-Wan-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalFaceSim0.4855Phantom-Wan-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalGmeScore0.6942Phantom-Wan-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalMotion0.1429Phantom-Wan-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalNaturalScore0.7026Phantom-Wan-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalNexusScore0.4244Phantom-Wan-1.3B
1 Image, 2*2 StitchiOpenS2V-EvalTotal Score0.5071Phantom-Wan-1.3B
Image to Video GenerationOpenS2V-EvalAesthetics0.4639Phantom-Wan-14B
Image to Video GenerationOpenS2V-EvalFaceSim0.5148Phantom-Wan-14B
Image to Video GenerationOpenS2V-EvalGmeScore0.7065Phantom-Wan-14B
Image to Video GenerationOpenS2V-EvalMotion0.3342Phantom-Wan-14B
Image to Video GenerationOpenS2V-EvalNaturalScore0.6866Phantom-Wan-14B
Image to Video GenerationOpenS2V-EvalNexusScore0.3743Phantom-Wan-14B
Image to Video GenerationOpenS2V-EvalTotal Score0.5232Phantom-Wan-14B
Image to Video GenerationOpenS2V-EvalAesthetics0.4667Phantom-Wan-1.3B
Image to Video GenerationOpenS2V-EvalFaceSim0.4855Phantom-Wan-1.3B
Image to Video GenerationOpenS2V-EvalGmeScore0.6942Phantom-Wan-1.3B
Image to Video GenerationOpenS2V-EvalMotion0.1429Phantom-Wan-1.3B
Image to Video GenerationOpenS2V-EvalNaturalScore0.7026Phantom-Wan-1.3B
Image to Video GenerationOpenS2V-EvalNexusScore0.4244Phantom-Wan-1.3B
Image to Video GenerationOpenS2V-EvalTotal Score0.5071Phantom-Wan-1.3B

Related Papers

Transformer-based Spatial Grounding: A Comprehensive Survey2025-07-17World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Leveraging Pre-Trained Visual Models for AI-Generated Video Detection2025-07-17Taming Diffusion Transformer for Real-Time Mobile Video Generation2025-07-17LoViC: Efficient Long Video Generation with Context Compression2025-07-17CATVis: Context-Aware Thought Visualization2025-07-15Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection2025-07-15$I^{2}$-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting2025-07-12