Continuous Speech Separation with Conformer

Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li, Takuya Yoshioka, Chengyi Wang, Shujie Liu, Ming Zhou

2020-08-13Speech Separation

Abstract

Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription. The separation model extracts a single speaker signal from a mixed speech. In this paper, we use transformer and conformer in lieu of recurrent neural networks in the separation system, as we believe capturing global information with the self-attention based method is crucial for the speech separation. Evaluating on the LibriCSS dataset, the conformer separation model achieves state of the art results, with a relative 23.5% word error rate (WER) reduction from bi-directional LSTM (BLSTM) in the utterance-wise evaluation and a 15.4% WER reduction in the continuous evaluation.

Results

Task	Dataset	Metric	Value	Model
Speech Separation	LibriCSS	0L	5	Conformer (large)
Speech Separation	LibriCSS	0S	5.4	Conformer (large)
Speech Separation	LibriCSS	10%	7.5	Conformer (large)
Speech Separation	LibriCSS	20%	10.7	Conformer (large)
Speech Separation	LibriCSS	30%	13.8	Conformer (large)
Speech Separation	LibriCSS	40%	17.1	Conformer (large)
Speech Separation	LibriCSS	0L	5.4	Conformer (base)
Speech Separation	LibriCSS	0S	5.6	Conformer (base)
Speech Separation	LibriCSS	10%	8.2	Conformer (base)
Speech Separation	LibriCSS	20%	11.8	Conformer (base)
Speech Separation	LibriCSS	30%	15.5	Conformer (base)
Speech Separation	LibriCSS	40%	18.9	Conformer (base)

Related Papers

Dynamic Slimmable Networks for Efficient Speech Separation2025-07-08 Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios2025-06-17 SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline2025-05-25 Attractor-Based Speech Separation of Multiple Utterances by Unknown Number of Speakers2025-05-22 Single-Channel Target Speech Extraction Utilizing Distance and Room Clues2025-05-20 Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation2025-05-19 SepPrune: Structured Pruning for Efficient Deep Speech Separation2025-05-17 A Survey of Deep Learning for Complex Speech Spectrograms2025-05-13