TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context

Nithin Rao Koluguri, Taejin Park, Boris Ginsburg

2021-10-08Speaker Verification Speaker Diarization

Abstract

In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification task with an equal error rate (EER) of 0.68% on the VoxCeleb1 trial file and also on speaker diarization tasks with diarization error rate (DER) of 1.73% on AMI-MixHeadset, 1.99% on AMI-Lapel and 1.11% on CH109. Furthermore, we investigate various sizes of TitaNet and present a light TitaNet-S model with only 6M parameters that achieve near state-of-the-art results in diarization tasks.

Results

Task	Dataset	Metric	Value	Model
Speaker Diarization	NIST-SRE 2000	DER(%)	5.73	x-vector (MCGAN)
Speaker Diarization	NIST-SRE 2000	DER(%)	6.37	TitaNet-S (NME-SC)
Speaker Diarization	NIST-SRE 2000	DER(%)	6.47	TitaNet-M (NME-SC)
Speaker Diarization	NIST-SRE 2000	DER(%)	6.73	TitaNet-L (NME-SC)
Speaker Diarization	NIST-SRE 2000	DER(%)	8.39	x-vector (PLDA + AHC)
Speaker Diarization	CALLHOME-109	DER(%)	1.11	titanet-s
Speaker Diarization	CH109	DER(%)	1.11	TitaNet-S (NME-SC)
Speaker Diarization	CH109	DER(%)	1.13	TitaNet-M (NME-SC)
Speaker Diarization	CH109	DER(%)	1.19	TitaNet-L (NME-SC)
Speaker Diarization	CH109	DER(%)	9.72	x-vector (PLDA + AHC)
Speaker Diarization	AMI MixHeadset	DER(%)	1.73	TitaNet-L (NME-SC)
Speaker Diarization	AMI MixHeadset	DER(%)	1.78	ECAPA (SC)
Speaker Diarization	AMI MixHeadset	DER(%)	1.79	TitaNet-M (NME-SC)
Speaker Diarization	AMI MixHeadset	DER(%)	2.22	TitaNet-S (NME-SC)
Speaker Diarization	AMI Lapel	DER(%)	1.99	TitaNet-M (NME-SC)
Speaker Diarization	AMI Lapel	DER(%)	2	TitaNet-S (NME-SC)
Speaker Diarization	AMI Lapel	DER(%)	2.03	TitaNet-L (NME-SC)
Speaker Diarization	AMI Lapel	DER(%)	2.36	ECAPA (SC)
Speaker Verification	VoxCeleb	EER	0.68	TitanNet -L
Speaker Verification	VoxCeleb	EER	0.81	TitanNet -M
Speaker Verification	VoxCeleb	EER	1.15	TitanNet -S

TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context

Abstract

Results

Related Papers

TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context

Abstract

Results

Related Papers