TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TitaNet: Neural Model for speaker representation with 1D D...

TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context

Nithin Rao Koluguri, Taejin Park, Boris Ginsburg

2021-10-08Speaker VerificationSpeaker Diarization
PaperPDFCode(official)Code

Abstract

In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification task with an equal error rate (EER) of 0.68% on the VoxCeleb1 trial file and also on speaker diarization tasks with diarization error rate (DER) of 1.73% on AMI-MixHeadset, 1.99% on AMI-Lapel and 1.11% on CH109. Furthermore, we investigate various sizes of TitaNet and present a light TitaNet-S model with only 6M parameters that achieve near state-of-the-art results in diarization tasks.

Results

TaskDatasetMetricValueModel
Speaker DiarizationNIST-SRE 2000DER(%)5.73x-vector (MCGAN)
Speaker DiarizationNIST-SRE 2000DER(%)6.37TitaNet-S (NME-SC)
Speaker DiarizationNIST-SRE 2000DER(%)6.47TitaNet-M (NME-SC)
Speaker DiarizationNIST-SRE 2000DER(%)6.73TitaNet-L (NME-SC)
Speaker DiarizationNIST-SRE 2000DER(%)8.39x-vector (PLDA + AHC)
Speaker DiarizationCALLHOME-109DER(%)1.11titanet-s
Speaker DiarizationCH109DER(%)1.11TitaNet-S (NME-SC)
Speaker DiarizationCH109DER(%)1.13TitaNet-M (NME-SC)
Speaker DiarizationCH109DER(%)1.19TitaNet-L (NME-SC)
Speaker DiarizationCH109DER(%)9.72x-vector (PLDA + AHC)
Speaker DiarizationAMI MixHeadsetDER(%)1.73TitaNet-L (NME-SC)
Speaker DiarizationAMI MixHeadsetDER(%)1.78ECAPA (SC)
Speaker DiarizationAMI MixHeadsetDER(%)1.79TitaNet-M (NME-SC)
Speaker DiarizationAMI MixHeadsetDER(%)2.22TitaNet-S (NME-SC)
Speaker DiarizationAMI LapelDER(%)1.99TitaNet-M (NME-SC)
Speaker DiarizationAMI LapelDER(%)2TitaNet-S (NME-SC)
Speaker DiarizationAMI LapelDER(%)2.03TitaNet-L (NME-SC)
Speaker DiarizationAMI LapelDER(%)2.36ECAPA (SC)
Speaker VerificationVoxCelebEER0.68TitanNet -L
Speaker VerificationVoxCelebEER0.81TitanNet -M
Speaker VerificationVoxCelebEER1.15TitanNet -S

Related Papers

SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks2025-07-17Efficient and Generalizable Speaker Diarization via Structured Pruning of Self-Supervised Models2025-06-23SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification2025-06-21Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models2025-06-17A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments2025-06-17M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset2025-06-17Exploring Speaker Diarization with Mixture of Experts2025-06-17Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models2025-06-16