TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Emotion Recognition in Speech using Cross-Modal Transfer i...

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

2018-08-16Facial Emotion RecognitionSpeech Emotion RecognitionFacial Expression Recognition (FER)Emotion Recognition
PaperPDF

Abstract

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.

Results

TaskDatasetMetricValueModel
Facial Recognition and ModellingFERPlusAccuracy(pretrained)88.88SENet Teacher
Face ReconstructionFERPlusAccuracy(pretrained)88.88SENet Teacher
Facial Expression Recognition (FER)FERPlusAccuracy(pretrained)88.88SENet Teacher
3DFERPlusAccuracy(pretrained)88.88SENet Teacher
3D Face ModellingFERPlusAccuracy(pretrained)88.88SENet Teacher
3D Face ReconstructionFERPlusAccuracy(pretrained)88.88SENet Teacher

Related Papers

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation2025-07-21Camera-based implicit mind reading by capturing higher-order semantic dynamics of human gaze within environmental context2025-07-17A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition2025-07-15Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11CAST-Phys: Contactless Affective States Through Physiological signals Database2025-07-08Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and Analysis2025-07-06Multimodal Prompt Alignment for Facial Expression Recognition2025-06-26How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models?2025-06-25