Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu

2025-02-16Single-Domain Subject-to-Video Open-Domain Subject-to-Video Human-Domain Subject-to-Video cross-modal alignment Video Generation

Paper PDF Code

Abstract

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.

Results

Task	Dataset	Metric	Value	Model
Video	OpenS2V-Eval	Aesthetics	0.4639	Phantom-Wan-14B
Video	OpenS2V-Eval	FaceSim	0.5148	Phantom-Wan-14B
Video	OpenS2V-Eval	GmeScore	0.7065	Phantom-Wan-14B
Video	OpenS2V-Eval	Motion	0.3342	Phantom-Wan-14B
Video	OpenS2V-Eval	NaturalScore	0.6866	Phantom-Wan-14B
Video	OpenS2V-Eval	NexusScore	0.3743	Phantom-Wan-14B
Video	OpenS2V-Eval	Total Score	0.5232	Phantom-Wan-14B
Video	OpenS2V-Eval	Aesthetics	0.4667	Phantom-Wan-1.3B
Video	OpenS2V-Eval	FaceSim	0.4855	Phantom-Wan-1.3B
Video	OpenS2V-Eval	GmeScore	0.6942	Phantom-Wan-1.3B
Video	OpenS2V-Eval	Motion	0.1429	Phantom-Wan-1.3B
Video	OpenS2V-Eval	NaturalScore	0.7026	Phantom-Wan-1.3B
Video	OpenS2V-Eval	NexusScore	0.4244	Phantom-Wan-1.3B
Video	OpenS2V-Eval	Total Score	0.5071	Phantom-Wan-1.3B
Video Generation	OpenS2V-Eval	Aesthetics	0.4639	Phantom-Wan-14B
Video Generation	OpenS2V-Eval	FaceSim	0.5148	Phantom-Wan-14B
Video Generation	OpenS2V-Eval	GmeScore	0.7065	Phantom-Wan-14B
Video Generation	OpenS2V-Eval	Motion	0.3342	Phantom-Wan-14B
Video Generation	OpenS2V-Eval	NaturalScore	0.6866	Phantom-Wan-14B
Video Generation	OpenS2V-Eval	NexusScore	0.3743	Phantom-Wan-14B
Video Generation	OpenS2V-Eval	Total Score	0.5232	Phantom-Wan-14B
Video Generation	OpenS2V-Eval	Aesthetics	0.4667	Phantom-Wan-1.3B
Video Generation	OpenS2V-Eval	FaceSim	0.4855	Phantom-Wan-1.3B
Video Generation	OpenS2V-Eval	GmeScore	0.6942	Phantom-Wan-1.3B
Video Generation	OpenS2V-Eval	Motion	0.1429	Phantom-Wan-1.3B
Video Generation	OpenS2V-Eval	NaturalScore	0.7026	Phantom-Wan-1.3B
Video Generation	OpenS2V-Eval	NexusScore	0.4244	Phantom-Wan-1.3B
Video Generation	OpenS2V-Eval	Total Score	0.5071	Phantom-Wan-1.3B
1 Image, 2*2 Stitchi	OpenS2V-Eval	Aesthetics	0.4639	Phantom-Wan-14B
1 Image, 2*2 Stitchi	OpenS2V-Eval	FaceSim	0.5148	Phantom-Wan-14B
1 Image, 2*2 Stitchi	OpenS2V-Eval	GmeScore	0.7065	Phantom-Wan-14B
1 Image, 2*2 Stitchi	OpenS2V-Eval	Motion	0.3342	Phantom-Wan-14B
1 Image, 2*2 Stitchi	OpenS2V-Eval	NaturalScore	0.6866	Phantom-Wan-14B
1 Image, 2*2 Stitchi	OpenS2V-Eval	NexusScore	0.3743	Phantom-Wan-14B
1 Image, 2*2 Stitchi	OpenS2V-Eval	Total Score	0.5232	Phantom-Wan-14B
1 Image, 2*2 Stitchi	OpenS2V-Eval	Aesthetics	0.4667	Phantom-Wan-1.3B
1 Image, 2*2 Stitchi	OpenS2V-Eval	FaceSim	0.4855	Phantom-Wan-1.3B
1 Image, 2*2 Stitchi	OpenS2V-Eval	GmeScore	0.6942	Phantom-Wan-1.3B
1 Image, 2*2 Stitchi	OpenS2V-Eval	Motion	0.1429	Phantom-Wan-1.3B
1 Image, 2*2 Stitchi	OpenS2V-Eval	NaturalScore	0.7026	Phantom-Wan-1.3B
1 Image, 2*2 Stitchi	OpenS2V-Eval	NexusScore	0.4244	Phantom-Wan-1.3B
1 Image, 2*2 Stitchi	OpenS2V-Eval	Total Score	0.5071	Phantom-Wan-1.3B
Image to Video Generation	OpenS2V-Eval	Aesthetics	0.4639	Phantom-Wan-14B
Image to Video Generation	OpenS2V-Eval	FaceSim	0.5148	Phantom-Wan-14B
Image to Video Generation	OpenS2V-Eval	GmeScore	0.7065	Phantom-Wan-14B
Image to Video Generation	OpenS2V-Eval	Motion	0.3342	Phantom-Wan-14B
Image to Video Generation	OpenS2V-Eval	NaturalScore	0.6866	Phantom-Wan-14B
Image to Video Generation	OpenS2V-Eval	NexusScore	0.3743	Phantom-Wan-14B
Image to Video Generation	OpenS2V-Eval	Total Score	0.5232	Phantom-Wan-14B
Image to Video Generation	OpenS2V-Eval	Aesthetics	0.4667	Phantom-Wan-1.3B
Image to Video Generation	OpenS2V-Eval	FaceSim	0.4855	Phantom-Wan-1.3B
Image to Video Generation	OpenS2V-Eval	GmeScore	0.6942	Phantom-Wan-1.3B
Image to Video Generation	OpenS2V-Eval	Motion	0.1429	Phantom-Wan-1.3B
Image to Video Generation	OpenS2V-Eval	NaturalScore	0.7026	Phantom-Wan-1.3B
Image to Video Generation	OpenS2V-Eval	NexusScore	0.4244	Phantom-Wan-1.3B
Image to Video Generation	OpenS2V-Eval	Total Score	0.5071	Phantom-Wan-1.3B

Phantom: Subject-consistent video generation via cross-modal alignment

Abstract

Results

Related Papers

Phantom: Subject-consistent video generation via cross-modal alignment

Abstract

Results

Related Papers