TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multi-Task Multi-Modal Self-Supervised Learning for Facial...

Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition

Marah Halawa, Florian Blume, Pia Bideau, Martin Maier, Rasha Abdel Rahman, Olaf Hellwich

2024-04-16Emotion Recognition in ConversationEmotion ClassificationSelf-Supervised LearningFacial Expression Recognition
PaperPDFCode(official)

Abstract

Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly

Results

TaskDatasetMetricValueModel
Emotion RecognitionMELDAccuracy60.03ConCluGen
Emotion RecognitionMELDBalanced Accuracy60.03ConCluGen
Text ClassificationCMU-MOSEIAccuracy66.48ConCluGen
Text ClassificationCMU-MOSEIWeighted Accuracy66.48ConCluGen
Emotion ClassificationCMU-MOSEIAccuracy66.48ConCluGen
Emotion ClassificationCMU-MOSEIWeighted Accuracy66.48ConCluGen
ClassificationCMU-MOSEIAccuracy66.48ConCluGen
ClassificationCMU-MOSEIWeighted Accuracy66.48ConCluGen
Facial Expression RecognitionMELDWeighted Accuracy60.03ConCluGen
Facial Expression RecognitionCMU-MOSEIWeighted Accuracy66.48ConCluGen

Related Papers

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation2025-07-21NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model2025-07-01ShapeEmbed: a self-supervised learning framework for 2D contour quantification2025-07-01