FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura

2021-12-10CVPR 2022 13D Face Animation

Abstract

Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. Prior works typically focus on learning phoneme-level features of short audio windows with limited context, occasionally resulting in inaccurate lip movements. To tackle this limitation, we propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes. To cope with the data scarcity issue, we integrate the self-supervised pre-trained speech representations. Also, we devise two biased attention mechanisms well suited to this specific task, including the biased cross-modal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy. The former effectively aligns the audio-motion modalities, whereas the latter offers abilities to generalize to longer audio sequences. Extensive experiments and a perceptual user study show that our approach outperforms the existing state-of-the-arts. The code will be made available.

Results

Task	Dataset	Metric	Value	Model
3D Human Pose Estimation	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	FDD	4.6408	FaceFormer
3D Human Pose Estimation	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	Lip Vertex Error	5.3077	FaceFormer
3D Human Pose Estimation	VOCASET	Lip Vertex Error	5.3742	FaceFormer
3D Human Pose Estimation	BEAT2	MSE	7.787	FaceFormer
Pose Estimation	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	FDD	4.6408	FaceFormer
Pose Estimation	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	Lip Vertex Error	5.3077	FaceFormer
Pose Estimation	VOCASET	Lip Vertex Error	5.3742	FaceFormer
Pose Estimation	BEAT2	MSE	7.787	FaceFormer
3D	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	FDD	4.6408	FaceFormer
3D	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	Lip Vertex Error	5.3077	FaceFormer
3D	VOCASET	Lip Vertex Error	5.3742	FaceFormer
3D	BEAT2	MSE	7.787	FaceFormer
3D Face Animation	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	FDD	4.6408	FaceFormer
3D Face Animation	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	Lip Vertex Error	5.3077	FaceFormer
3D Face Animation	VOCASET	Lip Vertex Error	5.3742	FaceFormer
3D Face Animation	BEAT2	MSE	7.787	FaceFormer
2D Human Pose Estimation	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	FDD	4.6408	FaceFormer
2D Human Pose Estimation	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	Lip Vertex Error	5.3077	FaceFormer
2D Human Pose Estimation	VOCASET	Lip Vertex Error	5.3742	FaceFormer
2D Human Pose Estimation	BEAT2	MSE	7.787	FaceFormer
3D Absolute Human Pose Estimation	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	FDD	4.6408	FaceFormer
3D Absolute Human Pose Estimation	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	Lip Vertex Error	5.3077	FaceFormer
3D Absolute Human Pose Estimation	VOCASET	Lip Vertex Error	5.3742	FaceFormer
3D Absolute Human Pose Estimation	BEAT2	MSE	7.787	FaceFormer
1 Image, 2*2 Stitchi	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	FDD	4.6408	FaceFormer
1 Image, 2*2 Stitchi	Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2	Lip Vertex Error	5.3077	FaceFormer
1 Image, 2*2 Stitchi	VOCASET	Lip Vertex Error	5.3742	FaceFormer
1 Image, 2*2 Stitchi	BEAT2	MSE	7.787	FaceFormer

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Abstract

Results

Related Papers

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Abstract

Results

Related Papers