Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Younglo Lee, Shukjae Choi, Byeong-Yeol Kim, Zhong-Qiu Wang, Shinji Watanabe

2024-01-23Speech Separation Speaker Separation

Abstract

We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers.

Results

Task	Dataset	Metric	Value	Model
Speech Separation	WSJ0-5mix	SI-SDRi	21	SepTDA
Speech Separation	WSJ0-2mix	SI-SDRi	24	SepTDA (L=12)
Speech Separation	WSJ0-3mix	SI-SDRi	23.7	SepTDA
Speech Separation	WSJ0-4mix	SI-SDRi	22	SepTDA

Related Papers

Dynamic Slimmable Networks for Efficient Speech Separation2025-07-08 Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios2025-06-17 SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition2025-06-15 SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline2025-05-25 Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers2025-05-22 Single-Channel Target Speech Extraction Utilizing Distance and Room Clues2025-05-20 Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation2025-05-19 SepPrune: Structured Pruning for Efficient Deep Speech Separation2025-05-17