Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan

2024-11-26CVPR 2025 1Text-to-Video Generation Open-Domain Subject-to-Video Human-Domain Subject-to-Video Image to Video Generation Video Generation

Paper PDF Code(official)

Abstract

Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video. Inspired by prior findings in frequency analysis of diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features and high-frequency intrinsic features. First, from a low-frequency perspective, we introduce a global facial extractor, which encodes reference images and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into transformer blocks, enhancing the model's ability to preserve fine-grained features. We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID generates high-quality, identity-preserving videos, making strides towards more effective IPT2V. Code: https://github.com/PKU-YuanGroup/ConsisID.

Results

Task	Dataset	Metric	Value	Model
Video	OpenS2V-Eval	Aesthetics	0.446	Kling 1.6
Video	OpenS2V-Eval	FaceSim	0.401	Kling 1.6
Video	OpenS2V-Eval	GmeScore	0.662	Kling 1.6
Video	OpenS2V-Eval	Motion	0.416	Kling 1.6
Video	OpenS2V-Eval	NaturalScore	0.7906	Kling 1.6
Video	OpenS2V-Eval	NexusScore	0.4592	Kling 1.6
Video	OpenS2V-Eval	Total Score	0.5446	Kling 1.6
Video	OpenS2V-Eval	Aesthetics	0.4687	Pika 2.1
Video	OpenS2V-Eval	FaceSim	0.308	Pika 2.1
Video	OpenS2V-Eval	GmeScore	0.6921	Pika 2.1
Video	OpenS2V-Eval	Motion	0.247	Pika 2.1
Video	OpenS2V-Eval	NaturalScore	0.6979	Pika 2.1
Video	OpenS2V-Eval	NexusScore	0.4541	Pika 2.1
Video	OpenS2V-Eval	Total Score	0.4888	Pika 2.1
Video	OpenS2V-Eval	Aesthetics	0.4147	Vidu 2.0
Video	OpenS2V-Eval	FaceSim	0.3511	Vidu 2.0
Video	OpenS2V-Eval	GmeScore	0.6757	Vidu 2.0
Video	OpenS2V-Eval	Motion	0.1352	Vidu 2.0
Video	OpenS2V-Eval	NaturalScore	0.7144	Vidu 2.0
Video	OpenS2V-Eval	NexusScore	0.4355	Vidu 2.0
Video	OpenS2V-Eval	Total Score	0.4759	Vidu 2.0
Video Generation	OpenS2V-Eval	Aesthetics	0.446	Kling 1.6
Video Generation	OpenS2V-Eval	FaceSim	0.401	Kling 1.6
Video Generation	OpenS2V-Eval	GmeScore	0.662	Kling 1.6
Video Generation	OpenS2V-Eval	Motion	0.416	Kling 1.6
Video Generation	OpenS2V-Eval	NaturalScore	0.7906	Kling 1.6
Video Generation	OpenS2V-Eval	NexusScore	0.4592	Kling 1.6
Video Generation	OpenS2V-Eval	Total Score	0.5446	Kling 1.6
Video Generation	OpenS2V-Eval	Aesthetics	0.4687	Pika 2.1
Video Generation	OpenS2V-Eval	FaceSim	0.308	Pika 2.1
Video Generation	OpenS2V-Eval	GmeScore	0.6921	Pika 2.1
Video Generation	OpenS2V-Eval	Motion	0.247	Pika 2.1
Video Generation	OpenS2V-Eval	NaturalScore	0.6979	Pika 2.1
Video Generation	OpenS2V-Eval	NexusScore	0.4541	Pika 2.1
Video Generation	OpenS2V-Eval	Total Score	0.4888	Pika 2.1
Video Generation	OpenS2V-Eval	Aesthetics	0.4147	Vidu 2.0
Video Generation	OpenS2V-Eval	FaceSim	0.3511	Vidu 2.0
Video Generation	OpenS2V-Eval	GmeScore	0.6757	Vidu 2.0
Video Generation	OpenS2V-Eval	Motion	0.1352	Vidu 2.0
Video Generation	OpenS2V-Eval	NaturalScore	0.7144	Vidu 2.0
Video Generation	OpenS2V-Eval	NexusScore	0.4355	Vidu 2.0
Video Generation	OpenS2V-Eval	Total Score	0.4759	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	Aesthetics	0.446	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	FaceSim	0.401	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	GmeScore	0.662	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	Motion	0.416	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	NaturalScore	0.7906	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	NexusScore	0.4592	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	Total Score	0.5446	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	Aesthetics	0.4687	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	FaceSim	0.308	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	GmeScore	0.6921	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	Motion	0.247	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	NaturalScore	0.6979	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	NexusScore	0.4541	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	Total Score	0.4888	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	Aesthetics	0.4147	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	FaceSim	0.3511	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	GmeScore	0.6757	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	Motion	0.1352	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	NaturalScore	0.7144	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	NexusScore	0.4355	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	Total Score	0.4759	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	Aesthetics	0.446	Kling 1.6
Image to Video Generation	OpenS2V-Eval	FaceSim	0.401	Kling 1.6
Image to Video Generation	OpenS2V-Eval	GmeScore	0.662	Kling 1.6
Image to Video Generation	OpenS2V-Eval	Motion	0.416	Kling 1.6
Image to Video Generation	OpenS2V-Eval	NaturalScore	0.7906	Kling 1.6
Image to Video Generation	OpenS2V-Eval	NexusScore	0.4592	Kling 1.6
Image to Video Generation	OpenS2V-Eval	Total Score	0.5446	Kling 1.6
Image to Video Generation	OpenS2V-Eval	Aesthetics	0.4687	Pika 2.1
Image to Video Generation	OpenS2V-Eval	FaceSim	0.308	Pika 2.1
Image to Video Generation	OpenS2V-Eval	GmeScore	0.6921	Pika 2.1
Image to Video Generation	OpenS2V-Eval	Motion	0.247	Pika 2.1
Image to Video Generation	OpenS2V-Eval	NaturalScore	0.6979	Pika 2.1
Image to Video Generation	OpenS2V-Eval	NexusScore	0.4541	Pika 2.1
Image to Video Generation	OpenS2V-Eval	Total Score	0.4888	Pika 2.1
Image to Video Generation	OpenS2V-Eval	Aesthetics	0.4147	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	FaceSim	0.3511	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	GmeScore	0.6757	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	Motion	0.1352	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	NaturalScore	0.7144	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	NexusScore	0.4355	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	Total Score	0.4759	Vidu 2.0

Abstract

Results

Task	Dataset	Metric	Value	Model
Video	OpenS2V-Eval	Aesthetics	0.446	Kling 1.6
Video	OpenS2V-Eval	FaceSim	0.401	Kling 1.6
Video	OpenS2V-Eval	GmeScore	0.662	Kling 1.6
Video	OpenS2V-Eval	Motion	0.416	Kling 1.6
Video	OpenS2V-Eval	NaturalScore	0.7906	Kling 1.6
Video	OpenS2V-Eval	NexusScore	0.4592	Kling 1.6
Video	OpenS2V-Eval	Total Score	0.5446	Kling 1.6
Video	OpenS2V-Eval	Aesthetics	0.4687	Pika 2.1
Video	OpenS2V-Eval	FaceSim	0.308	Pika 2.1
Video	OpenS2V-Eval	GmeScore	0.6921	Pika 2.1
Video	OpenS2V-Eval	Motion	0.247	Pika 2.1
Video	OpenS2V-Eval	NaturalScore	0.6979	Pika 2.1
Video	OpenS2V-Eval	NexusScore	0.4541	Pika 2.1
Video	OpenS2V-Eval	Total Score	0.4888	Pika 2.1
Video	OpenS2V-Eval	Aesthetics	0.4147	Vidu 2.0
Video	OpenS2V-Eval	FaceSim	0.3511	Vidu 2.0
Video	OpenS2V-Eval	GmeScore	0.6757	Vidu 2.0
Video	OpenS2V-Eval	Motion	0.1352	Vidu 2.0
Video	OpenS2V-Eval	NaturalScore	0.7144	Vidu 2.0
Video	OpenS2V-Eval	NexusScore	0.4355	Vidu 2.0
Video	OpenS2V-Eval	Total Score	0.4759	Vidu 2.0
Video Generation	OpenS2V-Eval	Aesthetics	0.446	Kling 1.6
Video Generation	OpenS2V-Eval	FaceSim	0.401	Kling 1.6
Video Generation	OpenS2V-Eval	GmeScore	0.662	Kling 1.6
Video Generation	OpenS2V-Eval	Motion	0.416	Kling 1.6
Video Generation	OpenS2V-Eval	NaturalScore	0.7906	Kling 1.6
Video Generation	OpenS2V-Eval	NexusScore	0.4592	Kling 1.6
Video Generation	OpenS2V-Eval	Total Score	0.5446	Kling 1.6
Video Generation	OpenS2V-Eval	Aesthetics	0.4687	Pika 2.1
Video Generation	OpenS2V-Eval	FaceSim	0.308	Pika 2.1
Video Generation	OpenS2V-Eval	GmeScore	0.6921	Pika 2.1
Video Generation	OpenS2V-Eval	Motion	0.247	Pika 2.1
Video Generation	OpenS2V-Eval	NaturalScore	0.6979	Pika 2.1
Video Generation	OpenS2V-Eval	NexusScore	0.4541	Pika 2.1
Video Generation	OpenS2V-Eval	Total Score	0.4888	Pika 2.1
Video Generation	OpenS2V-Eval	Aesthetics	0.4147	Vidu 2.0
Video Generation	OpenS2V-Eval	FaceSim	0.3511	Vidu 2.0
Video Generation	OpenS2V-Eval	GmeScore	0.6757	Vidu 2.0
Video Generation	OpenS2V-Eval	Motion	0.1352	Vidu 2.0
Video Generation	OpenS2V-Eval	NaturalScore	0.7144	Vidu 2.0
Video Generation	OpenS2V-Eval	NexusScore	0.4355	Vidu 2.0
Video Generation	OpenS2V-Eval	Total Score	0.4759	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	Aesthetics	0.446	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	FaceSim	0.401	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	GmeScore	0.662	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	Motion	0.416	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	NaturalScore	0.7906	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	NexusScore	0.4592	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	Total Score	0.5446	Kling 1.6
1 Image, 2*2 Stitchi	OpenS2V-Eval	Aesthetics	0.4687	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	FaceSim	0.308	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	GmeScore	0.6921	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	Motion	0.247	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	NaturalScore	0.6979	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	NexusScore	0.4541	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	Total Score	0.4888	Pika 2.1
1 Image, 2*2 Stitchi	OpenS2V-Eval	Aesthetics	0.4147	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	FaceSim	0.3511	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	GmeScore	0.6757	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	Motion	0.1352	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	NaturalScore	0.7144	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	NexusScore	0.4355	Vidu 2.0
1 Image, 2*2 Stitchi	OpenS2V-Eval	Total Score	0.4759	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	Aesthetics	0.446	Kling 1.6
Image to Video Generation	OpenS2V-Eval	FaceSim	0.401	Kling 1.6
Image to Video Generation	OpenS2V-Eval	GmeScore	0.662	Kling 1.6
Image to Video Generation	OpenS2V-Eval	Motion	0.416	Kling 1.6
Image to Video Generation	OpenS2V-Eval	NaturalScore	0.7906	Kling 1.6
Image to Video Generation	OpenS2V-Eval	NexusScore	0.4592	Kling 1.6
Image to Video Generation	OpenS2V-Eval	Total Score	0.5446	Kling 1.6
Image to Video Generation	OpenS2V-Eval	Aesthetics	0.4687	Pika 2.1
Image to Video Generation	OpenS2V-Eval	FaceSim	0.308	Pika 2.1
Image to Video Generation	OpenS2V-Eval	GmeScore	0.6921	Pika 2.1
Image to Video Generation	OpenS2V-Eval	Motion	0.247	Pika 2.1
Image to Video Generation	OpenS2V-Eval	NaturalScore	0.6979	Pika 2.1
Image to Video Generation	OpenS2V-Eval	NexusScore	0.4541	Pika 2.1
Image to Video Generation	OpenS2V-Eval	Total Score	0.4888	Pika 2.1
Image to Video Generation	OpenS2V-Eval	Aesthetics	0.4147	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	FaceSim	0.3511	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	GmeScore	0.6757	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	Motion	0.1352	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	NaturalScore	0.7144	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	NexusScore	0.4355	Vidu 2.0
Image to Video Generation	OpenS2V-Eval	Total Score	0.4759	Vidu 2.0

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Abstract

Results

Related Papers

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Abstract

Results

Related Papers