Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, Xinglong Wu
The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent video through textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. In particular, we emphasize subject consistency in human generation, covering existing ID-preserving video generation while offering enhanced advantages. The project homepage is here https://phantom-video.github.io/Phantom/.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | OpenS2V-Eval | Aesthetics | 0.4639 | Phantom-Wan-14B |
| Video | OpenS2V-Eval | FaceSim | 0.5148 | Phantom-Wan-14B |
| Video | OpenS2V-Eval | GmeScore | 0.7065 | Phantom-Wan-14B |
| Video | OpenS2V-Eval | Motion | 0.3342 | Phantom-Wan-14B |
| Video | OpenS2V-Eval | NaturalScore | 0.6866 | Phantom-Wan-14B |
| Video | OpenS2V-Eval | NexusScore | 0.3743 | Phantom-Wan-14B |
| Video | OpenS2V-Eval | Total Score | 0.5232 | Phantom-Wan-14B |
| Video | OpenS2V-Eval | Aesthetics | 0.4667 | Phantom-Wan-1.3B |
| Video | OpenS2V-Eval | FaceSim | 0.4855 | Phantom-Wan-1.3B |
| Video | OpenS2V-Eval | GmeScore | 0.6942 | Phantom-Wan-1.3B |
| Video | OpenS2V-Eval | Motion | 0.1429 | Phantom-Wan-1.3B |
| Video | OpenS2V-Eval | NaturalScore | 0.7026 | Phantom-Wan-1.3B |
| Video | OpenS2V-Eval | NexusScore | 0.4244 | Phantom-Wan-1.3B |
| Video | OpenS2V-Eval | Total Score | 0.5071 | Phantom-Wan-1.3B |
| Video Generation | OpenS2V-Eval | Aesthetics | 0.4639 | Phantom-Wan-14B |
| Video Generation | OpenS2V-Eval | FaceSim | 0.5148 | Phantom-Wan-14B |
| Video Generation | OpenS2V-Eval | GmeScore | 0.7065 | Phantom-Wan-14B |
| Video Generation | OpenS2V-Eval | Motion | 0.3342 | Phantom-Wan-14B |
| Video Generation | OpenS2V-Eval | NaturalScore | 0.6866 | Phantom-Wan-14B |
| Video Generation | OpenS2V-Eval | NexusScore | 0.3743 | Phantom-Wan-14B |
| Video Generation | OpenS2V-Eval | Total Score | 0.5232 | Phantom-Wan-14B |
| Video Generation | OpenS2V-Eval | Aesthetics | 0.4667 | Phantom-Wan-1.3B |
| Video Generation | OpenS2V-Eval | FaceSim | 0.4855 | Phantom-Wan-1.3B |
| Video Generation | OpenS2V-Eval | GmeScore | 0.6942 | Phantom-Wan-1.3B |
| Video Generation | OpenS2V-Eval | Motion | 0.1429 | Phantom-Wan-1.3B |
| Video Generation | OpenS2V-Eval | NaturalScore | 0.7026 | Phantom-Wan-1.3B |
| Video Generation | OpenS2V-Eval | NexusScore | 0.4244 | Phantom-Wan-1.3B |
| Video Generation | OpenS2V-Eval | Total Score | 0.5071 | Phantom-Wan-1.3B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | Aesthetics | 0.4639 | Phantom-Wan-14B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | FaceSim | 0.5148 | Phantom-Wan-14B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | GmeScore | 0.7065 | Phantom-Wan-14B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | Motion | 0.3342 | Phantom-Wan-14B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | NaturalScore | 0.6866 | Phantom-Wan-14B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | NexusScore | 0.3743 | Phantom-Wan-14B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | Total Score | 0.5232 | Phantom-Wan-14B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | Aesthetics | 0.4667 | Phantom-Wan-1.3B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | FaceSim | 0.4855 | Phantom-Wan-1.3B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | GmeScore | 0.6942 | Phantom-Wan-1.3B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | Motion | 0.1429 | Phantom-Wan-1.3B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | NaturalScore | 0.7026 | Phantom-Wan-1.3B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | NexusScore | 0.4244 | Phantom-Wan-1.3B |
| 1 Image, 2*2 Stitchi | OpenS2V-Eval | Total Score | 0.5071 | Phantom-Wan-1.3B |
| Image to Video Generation | OpenS2V-Eval | Aesthetics | 0.4639 | Phantom-Wan-14B |
| Image to Video Generation | OpenS2V-Eval | FaceSim | 0.5148 | Phantom-Wan-14B |
| Image to Video Generation | OpenS2V-Eval | GmeScore | 0.7065 | Phantom-Wan-14B |
| Image to Video Generation | OpenS2V-Eval | Motion | 0.3342 | Phantom-Wan-14B |
| Image to Video Generation | OpenS2V-Eval | NaturalScore | 0.6866 | Phantom-Wan-14B |
| Image to Video Generation | OpenS2V-Eval | NexusScore | 0.3743 | Phantom-Wan-14B |
| Image to Video Generation | OpenS2V-Eval | Total Score | 0.5232 | Phantom-Wan-14B |
| Image to Video Generation | OpenS2V-Eval | Aesthetics | 0.4667 | Phantom-Wan-1.3B |
| Image to Video Generation | OpenS2V-Eval | FaceSim | 0.4855 | Phantom-Wan-1.3B |
| Image to Video Generation | OpenS2V-Eval | GmeScore | 0.6942 | Phantom-Wan-1.3B |
| Image to Video Generation | OpenS2V-Eval | Motion | 0.1429 | Phantom-Wan-1.3B |
| Image to Video Generation | OpenS2V-Eval | NaturalScore | 0.7026 | Phantom-Wan-1.3B |
| Image to Video Generation | OpenS2V-Eval | NexusScore | 0.4244 | Phantom-Wan-1.3B |
| Image to Video Generation | OpenS2V-Eval | Total Score | 0.5071 | Phantom-Wan-1.3B |