Pushing the limits of raw waveform speaker recognition

Jee-weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

2022-03-16Speaker Recognition Speaker Verification Self-Supervised Learning

Abstract

In recent years, speaker recognition systems based on raw waveform inputs have received increasing attention. However, the performance of such systems are typically inferior to the state-of-the-art handcrafted feature-based counterparts, which demonstrate equal error rates under 1% on the popular VoxCeleb1 test set. This paper proposes a novel speaker recognition model based on raw waveform inputs. The model incorporates recent advances in machine learning and speaker verification, including the Res2Net backbone module and multi-layer feature aggregation. Our best model achieves an equal error rate of 0.89%, which is competitive with the state-of-the-art models based on handcrafted features, and outperforms the best model based on raw waveform inputs by a large margin. We also explore the application of the proposed model in the context of self-supervised learning framework. Our self-supervised model outperforms single phase-based existing works in this line of research. Finally, we show that self-supervised pre-training is effective for the semi-supervised scenario where we only have a small set of labelled training data, along with a larger set of unlabelled examples.

Related Papers

SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks2025-07-17 A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08 World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model2025-07-01 ShapeEmbed: a self-supervised learning framework for 2D contour quantification2025-07-01 RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models2025-06-27 Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features2025-06-26