Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation

Yuzhou Liu, DeLiang Wang

2019-04-25Speech Separation Clustering Speaker Separation

Abstract

We address talker-independent monaural speaker separation from the perspectives of deep learning and computational auditory scene analysis (CASA). Specifically, we decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping. Simultaneous grouping is first performed in each time frame by separating the spectra of different speakers with a permutation-invariantly trained neural network. In the second stage, the frame-level separated spectra are sequentially grouped to different speakers by a clustering network. The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives. Experimental results on the benchmark WSJ0-2mix database show that the new approach achieves the state-of-the-art results with a modest model size.

Results

Task	Dataset	Metric	Value	Model
Speech Separation	WSJ0-2mix	SI-SDRi	17.7	DeepCASA

Related Papers

Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18 Ranking Vectors Clustering: Theory and Applications2025-07-16 Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework2025-07-11 GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning2025-07-09 Dynamic Slimmable Networks for Efficient Speech Separation2025-07-08 Consistency and Inconsistency in $K$-Means Clustering2025-07-08 MC-INR: Efficient Encoding of Multivariate Scientific Simulation Data using Meta-Learning and Clustered Implicit Neural Representations2025-07-03 Supercm: Revisiting Clustering for Semi-Supervised Learning2025-06-30