TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TF-Locoformer: Transformer with Local Modeling by Convolut...

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux

2024-08-06Speech SeparationSpeech Enhancement
PaperPDFCode(official)

Abstract

Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.

Results

TaskDatasetMetricValueModel
Speech SeparationWHAMR!Number of parameters (M)15TF-Locoformer (M)
Speech SeparationWHAMR!SDRi16.9TF-Locoformer (M)
Speech SeparationWHAMR!SI-SDRi18.5TF-Locoformer (M)
Speech SeparationWHAMR!Number of parameters (M)5TF-Locoformer (S)
Speech SeparationWHAMR!SDRi15.9TF-Locoformer (S)
Speech SeparationWHAMR!SI-SDRi17.4TF-Locoformer (S)
Speech SeparationWSJ0-2mixNumber of parameters (M)22.5TF-Locoformer (L) + DM
Speech SeparationWSJ0-2mixSDRi25.2TF-Locoformer (L) + DM
Speech SeparationWSJ0-2mixSI-SDRi25.1TF-Locoformer (L) + DM
Speech SeparationWSJ0-2mixNumber of parameters (M)15TF-Locoformer (M) + DM
Speech SeparationWSJ0-2mixSDRi24.7TF-Locoformer (M) + DM
Speech SeparationWSJ0-2mixSI-SDRi24.6TF-Locoformer (M) + DM
Speech SeparationWSJ0-2mixNumber of parameters (M)22.5TF-Locoformer (L)
Speech SeparationWSJ0-2mixSDRi24.3TF-Locoformer (L)
Speech SeparationWSJ0-2mixSI-SDRi24.2TF-Locoformer (L)
Speech SeparationWSJ0-2mixNumber of parameters (M)15TF-Locoformer (M)
Speech SeparationWSJ0-2mixSDRi23.8TF-Locoformer (M)
Speech SeparationWSJ0-2mixSI-SDRi23.6TF-Locoformer (M)
Speech SeparationWSJ0-2mixNumber of parameters (M)5TF-Locoformer (S) + DM
Speech SeparationWSJ0-2mixSDRi23TF-Locoformer (S) + DM
Speech SeparationWSJ0-2mixSI-SDRi22.8TF-Locoformer (S) + DM
Speech SeparationWSJ0-2mixNumber of parameters (M)5TF-Locoformer (S)
Speech SeparationWSJ0-2mixSDRi22.1TF-Locoformer (S)
Speech SeparationWSJ0-2mixSI-SDRi22TF-Locoformer (S)
Speech SeparationLibri2MixNumber of parameters (M)15TF-Locoformer (M)
Speech SeparationLibri2MixSDRi22.2TF-Locoformer (M)
Speech SeparationLibri2MixSI-SDRi22.1TF-Locoformer (M)
Speech EnhancementDeep Noise Suppression (DNS) ChallengeFLOPS (G)497.24TF-Locoformer (M)
Speech EnhancementDeep Noise Suppression (DNS) ChallengeNumber of parameters (M)15TF-Locoformer (M)
Speech EnhancementDeep Noise Suppression (DNS) ChallengePESQ-WB3.72TF-Locoformer (M)
Speech EnhancementDeep Noise Suppression (DNS) ChallengeSI-SDR-WB23.3TF-Locoformer (M)
Speech EnhancementDeep Noise Suppression (DNS) ChallengeSTOI98.8TF-Locoformer (M)

Related Papers

Autoregressive Speech Enhancement via Acoustic Tokens2025-07-17P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge2025-07-15Dynamic Slimmable Networks for Efficient Speech Separation2025-07-08Robust One-step Speech Enhancement via Consistency Distillation2025-07-08Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01Frequency-Weighted Training Losses for Phoneme-Level DNN-based Speech Enhancement2025-06-23EDNet: A Distortion-Agnostic Speech Enhancement Framework with Gating Mamba Mechanism and Phase Shift-Invariant Training2025-06-19