TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emoti...

A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding

Yingzhi Wang, Abdelmoumene Boumadane, Abdelwahab Heba

2021-11-04Speech RecognitionAutomatic Speech Recognitionintent-classificationAutomatic Speech Recognition (ASR)speech-recognitionSpeaker Verificationslot-fillingSlot FillingSpoken Language UnderstandingSpeech Emotion RecognitionIntent ClassificationEmotion Recognition
PaperPDF

Abstract

Speech self-supervised models such as wav2vec 2.0 and HuBERT are making revolutionary progress in Automatic Speech Recognition (ASR). However, they have not been totally proven to produce better performance on tasks other than ASR. In this work, we explored partial fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT pre-trained models for three non-ASR speech tasks: Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding. With simple proposed downstream frameworks, the best scores reached 79.58% weighted accuracy on speaker-dependent setting and 73.01% weighted accuracy on speaker-independent setting for Speech Emotion Recognition on IEMOCAP, 2.36% equal error rate for Speaker Verification on VoxCeleb1, 89.38% accuracy for Intent Classification and 78.92% F1 for Slot Filling on SLURP, showing the strength of fine-tuned wav2vec 2.0 and HuBERT on learning prosodic, voice-print and semantic representations.

Results

TaskDatasetMetricValueModel
Speaker VerificationVoxCeleb1EER2.36Fine-tuned HuBERT Large
Emotion RecognitionIEMOCAPWA0.796Partially Fine-tuned HuBERT Large
Emotion RecognitionIEMOCAPWA CV0.73Partially Fine-tuned HuBERT Large
Intent ClassificationSLURPAccuracy (%)87.51Partially Fine-tuned HuBERT
Slot FillingSLURPF10.753Partially Fine-tuned HuBERT
Speech Emotion RecognitionIEMOCAPWA0.796Partially Fine-tuned HuBERT Large
Speech Emotion RecognitionIEMOCAPWA CV0.73Partially Fine-tuned HuBERT Large

Related Papers

Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation2025-07-21Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks2025-07-17Camera-based implicit mind reading by capturing higher-order semantic dynamics of human gaze within environmental context2025-07-17A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition2025-07-15WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11