MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition

Peihao Xiang, Chaohao Lin, Kaida Wu, Ou Bai

2024-04-28Self-Supervised Learning Multimodal Emotion Recognition Video Emotion Recognition Emotion Recognition

Abstract

This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAEDER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the RAVDESS dataset and by 2.06% on the CREMAD. Furthermore, when compared with the state-of-the-art model of multimodal self-supervised learning, MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.

Results

Task	Dataset	Metric	Value	Model
Emotion Recognition	IEMOCAP-4	Weighted Recall	63.73	MultiMAE-DER
Multimodal Emotion Recognition	IEMOCAP-4	Weighted Recall	63.73	MultiMAE-DER

Related Papers

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation2025-07-21 A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 Camera-based implicit mind reading by capturing higher-order semantic dynamics of human gaze within environmental context2025-07-17 A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition2025-07-15 Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14 Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08 CAST-Phys: Contactless Affective States Through Physiological signals Database2025-07-08