TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Speech2AffectiveGestures: Synthesizing Co-Speech Gestures ...

Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

Uttaran Bhattacharya, Elizabeth Childs, Nicholas Rewkowski, Dinesh Manocha

2021-07-31Gesture Generation
PaperPDFCode(official)

Abstract

We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences. We leverage the Mel-frequency cepstral coefficients and the text transcript computed from the input speech in separate encoders in our generator to learn the desired sentiments and the associated affective cues. We design an affective encoder using multi-scale spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based affective features. We use our affective encoder in both our generator, where it learns affective features from the seed poses to guide the gesture synthesis, and our discriminator, where it enforces the synthesized gestures to contain the appropriate affective expressions. We perform extensive evaluations on two benchmark datasets for gesture synthesis from the speech, the TED Gesture Dataset and the GENEA Challenge 2020 Dataset. Compared to the best baselines, we improve the mean absolute joint error by 10--33%, the mean acceleration difference by 8--58%, and the Fr\'echet Gesture Distance by 21--34%. We also conduct a user study and observe that compared to the best current baselines, around 15.28% of participants indicated our synthesized gestures appear more plausible, and around 16.32% of participants felt the gestures had more appropriate affective expressions aligned with the speech.

Results

TaskDatasetMetricValueModel
3DTED Gesture DatasetFGD3.54Speech2AffectiveGestures
3D Shape GenerationTED Gesture DatasetFGD3.54Speech2AffectiveGestures

Related Papers

DeepGesture: A conversational gesture synthesis system based on emotions and semantics2025-07-03Intentional Gesture: Deliver Your Intentions with Gestures for Speech2025-05-21M3G: Multi-Granular Gesture Generator for Audio-Driven Full-Body Human Motion Synthesis2025-05-13Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication2025-05-08Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion2025-05-03EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation2025-04-12EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model2025-04-11Audio-driven Gesture Generation via Deviation Feature in the Latent Space2025-03-27