TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux

2024-08-06Speech Separation Speech Enhancement

Abstract

Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.

Results

Task	Dataset	Metric	Value	Model
Speech Separation	WHAMR!	Number of parameters (M)	15	TF-Locoformer (M)
Speech Separation	WHAMR!	SDRi	16.9	TF-Locoformer (M)
Speech Separation	WHAMR!	SI-SDRi	18.5	TF-Locoformer (M)
Speech Separation	WHAMR!	Number of parameters (M)	5	TF-Locoformer (S)
Speech Separation	WHAMR!	SDRi	15.9	TF-Locoformer (S)
Speech Separation	WHAMR!	SI-SDRi	17.4	TF-Locoformer (S)
Speech Separation	WSJ0-2mix	Number of parameters (M)	22.5	TF-Locoformer (L) + DM
Speech Separation	WSJ0-2mix	SDRi	25.2	TF-Locoformer (L) + DM
Speech Separation	WSJ0-2mix	SI-SDRi	25.1	TF-Locoformer (L) + DM
Speech Separation	WSJ0-2mix	Number of parameters (M)	15	TF-Locoformer (M) + DM
Speech Separation	WSJ0-2mix	SDRi	24.7	TF-Locoformer (M) + DM
Speech Separation	WSJ0-2mix	SI-SDRi	24.6	TF-Locoformer (M) + DM
Speech Separation	WSJ0-2mix	Number of parameters (M)	22.5	TF-Locoformer (L)
Speech Separation	WSJ0-2mix	SDRi	24.3	TF-Locoformer (L)
Speech Separation	WSJ0-2mix	SI-SDRi	24.2	TF-Locoformer (L)
Speech Separation	WSJ0-2mix	Number of parameters (M)	15	TF-Locoformer (M)
Speech Separation	WSJ0-2mix	SDRi	23.8	TF-Locoformer (M)
Speech Separation	WSJ0-2mix	SI-SDRi	23.6	TF-Locoformer (M)
Speech Separation	WSJ0-2mix	Number of parameters (M)	5	TF-Locoformer (S) + DM
Speech Separation	WSJ0-2mix	SDRi	23	TF-Locoformer (S) + DM
Speech Separation	WSJ0-2mix	SI-SDRi	22.8	TF-Locoformer (S) + DM
Speech Separation	WSJ0-2mix	Number of parameters (M)	5	TF-Locoformer (S)
Speech Separation	WSJ0-2mix	SDRi	22.1	TF-Locoformer (S)
Speech Separation	WSJ0-2mix	SI-SDRi	22	TF-Locoformer (S)
Speech Separation	Libri2Mix	Number of parameters (M)	15	TF-Locoformer (M)
Speech Separation	Libri2Mix	SDRi	22.2	TF-Locoformer (M)
Speech Separation	Libri2Mix	SI-SDRi	22.1	TF-Locoformer (M)
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	FLOPS (G)	497.24	TF-Locoformer (M)
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	Number of parameters (M)	15	TF-Locoformer (M)
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	PESQ-WB	3.72	TF-Locoformer (M)
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	SI-SDR-WB	23.3	TF-Locoformer (M)
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	STOI	98.8	TF-Locoformer (M)

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Abstract

Results

Related Papers

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Abstract

Results

Related Papers