CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation

Caroline Etienne, Guillaume Fidanza, Andrei Petrovskii, Laurence Devillers, Benoit Schmauch

2018-02-15Data Augmentation Speech Emotion Recognition Emotion Recognition

Abstract

In this work we design a neural network for recognizing emotions in speech, using the IEMOCAP dataset. Following the latest advances in audio analysis, we use an architecture involving both convolutional layers, for extracting high-level features from raw spectrograms, and recurrent ones for aggregating long-term dependencies. We examine the techniques of data augmentation with vocal track length perturbation, layer-wise optimizer adjustment, batch normalization of recurrent layers and obtain highly competitive results of 64.5% for weighted accuracy and 61.7% for unweighted accuracy on four emotions.

Results

Task	Dataset	Metric	Value	Model
Emotion Recognition	IEMOCAP	UA	0.65	CNN+LSTM
Speech Emotion Recognition	IEMOCAP	UA	0.65	CNN+LSTM

Related Papers

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation2025-07-21 Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17 Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17 Camera-based implicit mind reading by capturing higher-order semantic dynamics of human gaze within environmental context2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 Data Augmentation in Time Series Forecasting through Inverted Framework2025-07-15 A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition2025-07-15 Iceberg: Enhancing HLS Modeling with Synthetic Data2025-07-14