TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/HCAM -- Hierarchical Cross Attention Model for Multi-modal...

HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

Soumya Dutta, Sriram Ganapathy

2023-04-14Emotion Recognition in ConversationEmotion ClassificationMultimodal Emotion RecognitionEmotion Recognition
PaperPDF

Abstract

Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.

Results

TaskDatasetMetricValueModel
Emotion RecognitionMELDWeighted-F165.8Audio + Text (Stage III)
Emotion RecognitionCMU-MOSIF1 score0.858Audio + Text (Stage III)
Emotion RecognitionIEMOCAP-4F170.5Audio + Text (Stage III)
Emotion RecognitionMELDWeighted F165.8Audio + Text (Stage III)
Multimodal Emotion RecognitionIEMOCAP-4F170.5Audio + Text (Stage III)
Multimodal Emotion RecognitionMELDWeighted F165.8Audio + Text (Stage III)

Related Papers

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation2025-07-21NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Camera-based implicit mind reading by capturing higher-order semantic dynamics of human gaze within environmental context2025-07-17A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition2025-07-15Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11CAST-Phys: Contactless Affective States Through Physiological signals Database2025-07-08Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset, Experiment, and Analysis2025-07-06How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models?2025-06-25