Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura
Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. Prior works typically focus on learning phoneme-level features of short audio windows with limited context, occasionally resulting in inaccurate lip movements. To tackle this limitation, we propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes. To cope with the data scarcity issue, we integrate the self-supervised pre-trained speech representations. Also, we devise two biased attention mechanisms well suited to this specific task, including the biased cross-modal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy. The former effectively aligns the audio-motion modalities, whereas the latter offers abilities to generalize to longer audio sequences. Extensive experiments and a perceptual user study show that our approach outperforms the existing state-of-the-arts. The code will be made available.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| 3D Human Pose Estimation | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | FDD | 4.6408 | FaceFormer |
| 3D Human Pose Estimation | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | Lip Vertex Error | 5.3077 | FaceFormer |
| 3D Human Pose Estimation | VOCASET | Lip Vertex Error | 5.3742 | FaceFormer |
| 3D Human Pose Estimation | BEAT2 | MSE | 7.787 | FaceFormer |
| Pose Estimation | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | FDD | 4.6408 | FaceFormer |
| Pose Estimation | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | Lip Vertex Error | 5.3077 | FaceFormer |
| Pose Estimation | VOCASET | Lip Vertex Error | 5.3742 | FaceFormer |
| Pose Estimation | BEAT2 | MSE | 7.787 | FaceFormer |
| 3D | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | FDD | 4.6408 | FaceFormer |
| 3D | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | Lip Vertex Error | 5.3077 | FaceFormer |
| 3D | VOCASET | Lip Vertex Error | 5.3742 | FaceFormer |
| 3D | BEAT2 | MSE | 7.787 | FaceFormer |
| 3D Face Animation | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | FDD | 4.6408 | FaceFormer |
| 3D Face Animation | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | Lip Vertex Error | 5.3077 | FaceFormer |
| 3D Face Animation | VOCASET | Lip Vertex Error | 5.3742 | FaceFormer |
| 3D Face Animation | BEAT2 | MSE | 7.787 | FaceFormer |
| 2D Human Pose Estimation | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | FDD | 4.6408 | FaceFormer |
| 2D Human Pose Estimation | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | Lip Vertex Error | 5.3077 | FaceFormer |
| 2D Human Pose Estimation | VOCASET | Lip Vertex Error | 5.3742 | FaceFormer |
| 2D Human Pose Estimation | BEAT2 | MSE | 7.787 | FaceFormer |
| 3D Absolute Human Pose Estimation | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | FDD | 4.6408 | FaceFormer |
| 3D Absolute Human Pose Estimation | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | Lip Vertex Error | 5.3077 | FaceFormer |
| 3D Absolute Human Pose Estimation | VOCASET | Lip Vertex Error | 5.3742 | FaceFormer |
| 3D Absolute Human Pose Estimation | BEAT2 | MSE | 7.787 | FaceFormer |
| 1 Image, 2*2 Stitchi | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | FDD | 4.6408 | FaceFormer |
| 1 Image, 2*2 Stitchi | Biwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2 | Lip Vertex Error | 5.3077 | FaceFormer |
| 1 Image, 2*2 Stitchi | VOCASET | Lip Vertex Error | 5.3742 | FaceFormer |
| 1 Image, 2*2 Stitchi | BEAT2 | MSE | 7.787 | FaceFormer |