Sandglasset: A Light Multi-Granularity Self-attentive Network For Time-Domain Speech Separation

Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

2021-03-01Speech Separation

Abstract

One of the leading single-channel speech separation (SS) models is based on a TasNet with a dual-path segmentation technique, where the size of each segment remains unchanged throughout all layers. In contrast, our key finding is that multi-granularity features are essential for enhancing contextual modeling and computational efficiency. We introduce a self-attentive network with a novel sandglass-shape, namely Sandglasset, which advances the state-of-the-art (SOTA) SS performance at significantly smaller model size and computational cost. Forward along each block inside Sandglasset, the temporal granularity of the features gradually becomes coarser until reaching half of the network blocks, and then successively turns finer towards the raw signal level. We also unfold that residual connections between features with the same granularity are critical for preserving information after passing through the bottleneck layer. Experiments show our Sandglasset with only 2.3M parameters has achieved the best results on two benchmark SS datasets -- WSJ0-2mix and WSJ0-3mix, where the SI-SNRi scores have been improved by absolute 0.8 dB and 2.4 dB, respectively, comparing to the prior SOTA results.

Results

Task	Dataset	Metric	Value	Model
Speech Separation	WSJ0-2mix	SI-SDRi	21	Sandglasset
Speech Separation	WSJ0-3mix	SI-SDRi	17.1	Sandglasset

Related Papers

Dynamic Slimmable Networks for Efficient Speech Separation2025-07-08 Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios2025-06-17 SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline2025-05-25 Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers2025-05-22 Single-Channel Target Speech Extraction Utilizing Distance and Room Clues2025-05-20 Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation2025-05-19 SepPrune: Structured Pruning for Efficient Deep Speech Separation2025-05-17 A Survey of Deep Learning for Complex Speech Spectrograms2025-05-13