Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux
Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Speech Separation | WHAMR! | Number of parameters (M) | 15 | TF-Locoformer (M) |
| Speech Separation | WHAMR! | SDRi | 16.9 | TF-Locoformer (M) |
| Speech Separation | WHAMR! | SI-SDRi | 18.5 | TF-Locoformer (M) |
| Speech Separation | WHAMR! | Number of parameters (M) | 5 | TF-Locoformer (S) |
| Speech Separation | WHAMR! | SDRi | 15.9 | TF-Locoformer (S) |
| Speech Separation | WHAMR! | SI-SDRi | 17.4 | TF-Locoformer (S) |
| Speech Separation | WSJ0-2mix | Number of parameters (M) | 22.5 | TF-Locoformer (L) + DM |
| Speech Separation | WSJ0-2mix | SDRi | 25.2 | TF-Locoformer (L) + DM |
| Speech Separation | WSJ0-2mix | SI-SDRi | 25.1 | TF-Locoformer (L) + DM |
| Speech Separation | WSJ0-2mix | Number of parameters (M) | 15 | TF-Locoformer (M) + DM |
| Speech Separation | WSJ0-2mix | SDRi | 24.7 | TF-Locoformer (M) + DM |
| Speech Separation | WSJ0-2mix | SI-SDRi | 24.6 | TF-Locoformer (M) + DM |
| Speech Separation | WSJ0-2mix | Number of parameters (M) | 22.5 | TF-Locoformer (L) |
| Speech Separation | WSJ0-2mix | SDRi | 24.3 | TF-Locoformer (L) |
| Speech Separation | WSJ0-2mix | SI-SDRi | 24.2 | TF-Locoformer (L) |
| Speech Separation | WSJ0-2mix | Number of parameters (M) | 15 | TF-Locoformer (M) |
| Speech Separation | WSJ0-2mix | SDRi | 23.8 | TF-Locoformer (M) |
| Speech Separation | WSJ0-2mix | SI-SDRi | 23.6 | TF-Locoformer (M) |
| Speech Separation | WSJ0-2mix | Number of parameters (M) | 5 | TF-Locoformer (S) + DM |
| Speech Separation | WSJ0-2mix | SDRi | 23 | TF-Locoformer (S) + DM |
| Speech Separation | WSJ0-2mix | SI-SDRi | 22.8 | TF-Locoformer (S) + DM |
| Speech Separation | WSJ0-2mix | Number of parameters (M) | 5 | TF-Locoformer (S) |
| Speech Separation | WSJ0-2mix | SDRi | 22.1 | TF-Locoformer (S) |
| Speech Separation | WSJ0-2mix | SI-SDRi | 22 | TF-Locoformer (S) |
| Speech Separation | Libri2Mix | Number of parameters (M) | 15 | TF-Locoformer (M) |
| Speech Separation | Libri2Mix | SDRi | 22.2 | TF-Locoformer (M) |
| Speech Separation | Libri2Mix | SI-SDRi | 22.1 | TF-Locoformer (M) |
| Speech Enhancement | Deep Noise Suppression (DNS) Challenge | FLOPS (G) | 497.24 | TF-Locoformer (M) |
| Speech Enhancement | Deep Noise Suppression (DNS) Challenge | Number of parameters (M) | 15 | TF-Locoformer (M) |
| Speech Enhancement | Deep Noise Suppression (DNS) Challenge | PESQ-WB | 3.72 | TF-Locoformer (M) |
| Speech Enhancement | Deep Noise Suppression (DNS) Challenge | SI-SDR-WB | 23.3 | TF-Locoformer (M) |
| Speech Enhancement | Deep Noise Suppression (DNS) Challenge | STOI | 98.8 | TF-Locoformer (M) |