Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li, Takuya Yoshioka, Chengyi Wang, Shujie Liu, Ming Zhou
Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription. The separation model extracts a single speaker signal from a mixed speech. In this paper, we use transformer and conformer in lieu of recurrent neural networks in the separation system, as we believe capturing global information with the self-attention based method is crucial for the speech separation. Evaluating on the LibriCSS dataset, the conformer separation model achieves state of the art results, with a relative 23.5% word error rate (WER) reduction from bi-directional LSTM (BLSTM) in the utterance-wise evaluation and a 15.4% WER reduction in the continuous evaluation.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Speech Separation | LibriCSS | 0L | 5 | Conformer (large) |
| Speech Separation | LibriCSS | 0S | 5.4 | Conformer (large) |
| Speech Separation | LibriCSS | 10% | 7.5 | Conformer (large) |
| Speech Separation | LibriCSS | 20% | 10.7 | Conformer (large) |
| Speech Separation | LibriCSS | 30% | 13.8 | Conformer (large) |
| Speech Separation | LibriCSS | 40% | 17.1 | Conformer (large) |
| Speech Separation | LibriCSS | 0L | 5.4 | Conformer (base) |
| Speech Separation | LibriCSS | 0S | 5.6 | Conformer (base) |
| Speech Separation | LibriCSS | 10% | 8.2 | Conformer (base) |
| Speech Separation | LibriCSS | 20% | 11.8 | Conformer (base) |
| Speech Separation | LibriCSS | 30% | 15.5 | Conformer (base) |
| Speech Separation | LibriCSS | 40% | 18.9 | Conformer (base) |