Many-Speakers Single Channel Speech Separation with Optimal Permutation Training

Shaked Dovrat, Eliya Nachmani, Lior Wolf

2021-04-18Speech Separation

Abstract

Single channel speech separation has experienced great progress in the last few years. However, training neural speech separation for a large number of speakers (e.g., more than 10 speakers) is out of reach for the current methods, which rely on the Permutation Invariant Loss (PIT). In this work, we present a permutation invariant training that employs the Hungarian algorithm in order to train with an $O(C^3)$ time complexity, where $C$ is the number of speakers, in comparison to $O(C!)$ of PIT based methods. Furthermore, we present a modified architecture that can handle the increased number of speakers. Our approach separates up to $20$ speakers and improves the previous results for large $C$ by a wide margin.

Results

Task	Dataset	Metric	Value	Model
Speech Separation	WSJ0-5mix	SI-SDRi	13.22	Hungarian PIT
Speech Separation	Libri15Mix	SI-SDRi	5.66	Hungarian PIT
Speech Separation	Libri20Mix	SI-SDRi	4.26	Hungarian PIT
Speech Separation	Libri5Mix	SI-SDRi	12.72	Hungarian PIT
Speech Separation	Libri10Mix	SI-SDRi	7.78	Hungarian PIT

Related Papers

Dynamic Slimmable Networks for Efficient Speech Separation2025-07-08 Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios2025-06-17 SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline2025-05-25 Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers2025-05-22 Single-Channel Target Speech Extraction Utilizing Distance and Room Clues2025-05-20 Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation2025-05-19 SepPrune: Structured Pruning for Efficient Deep Speech Separation2025-05-17 A Survey of Deep Learning for Complex Speech Spectrograms2025-05-13