Speaker Normalization for Self-supervised Speech Emotion Recognition

Itai Gat, Hagai Aronowitz, Weizhong Zhu, Edmilson Morais, Ron Hoory

Abstract

Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.

Results

TaskDatasetMetricValueModel
Emotion RecognitionIEMOCAPWA0.81TAP
Emotion RecognitionIEMOCAPWA CV0.742TAP
Speech Emotion RecognitionIEMOCAPWA0.81TAP
Speech Emotion RecognitionIEMOCAPWA CV0.742TAP

Related Papers