TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facia...

FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning

Kazi Injamamul Haque, Zerrin Yumak

2023-03-09Representation Learning3D Face Animation
PaperPDFCode(official)

Abstract

This paper presents FaceXHuBERT, a text-less speech-driven 3D facial animation generation method that allows to capture personalized and subtle cues in speech (e.g. identity, emotion and hesitation). It is also very robust to background noise and can handle audio recorded in a variety of situations (e.g. multiple people speaking). Recent approaches employ end-to-end deep learning taking into account both audio and text as input to generate facial animation for the whole face. However, scarcity of publicly available expressive audio-3D facial animation datasets poses a major bottleneck. The resulting animations still have issues regarding accurate lip-synching, expressivity, person-specific information and generalizability. We effectively employ self-supervised pretrained HuBERT model in the training process that allows us to incorporate both lexical and non-lexical information in the audio without using a large lexicon. Additionally, guiding the training with a binary emotion condition and speaker identity distinguishes the tiniest subtle facial motion. We carried out extensive objective and subjective evaluation in comparison to ground-truth and state-of-the-art work. A perceptual user study demonstrates that our approach produces superior results with respect to the realism of the animation 78% of the time in comparison to the state-of-the-art. In addition, our method is 4 times faster eliminating the use of complex sequential models such as transformers. We strongly recommend watching the supplementary video before reading the paper. We also provide the implementation and evaluation codes with a GitHub repository link.

Results

TaskDatasetMetricValueModel
3D Human Pose EstimationBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2FDD4.96FaceXHuBERT
3D Human Pose EstimationBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2Lip Vertex Error4.56FaceXHuBERT
Pose EstimationBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2FDD4.96FaceXHuBERT
Pose EstimationBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2Lip Vertex Error4.56FaceXHuBERT
3DBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2FDD4.96FaceXHuBERT
3DBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2Lip Vertex Error4.56FaceXHuBERT
3D Face AnimationBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2FDD4.96FaceXHuBERT
3D Face AnimationBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2Lip Vertex Error4.56FaceXHuBERT
2D Human Pose EstimationBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2FDD4.96FaceXHuBERT
2D Human Pose EstimationBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2Lip Vertex Error4.56FaceXHuBERT
3D Absolute Human Pose EstimationBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2FDD4.96FaceXHuBERT
3D Absolute Human Pose EstimationBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2Lip Vertex Error4.56FaceXHuBERT
1 Image, 2*2 StitchiBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2FDD4.96FaceXHuBERT
1 Image, 2*2 StitchiBiwi 3D Audiovisual Corpus of Affective Communication - B3D(AC)^2Lip Vertex Error4.56FaceXHuBERT

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15Dual Dimensions Geometric Representation Learning Based Document Dewarping2025-07-11