Nikolai Lund Kühne, Jan Østergaard, Jesper Jensen, Zheng-Hua Tan
While attention-based architectures, such as Conformers, excel in speech enhancement, they face challenges such as scalability with respect to input sequence length. In contrast, the recently proposed Extended Long Short-Term Memory (xLSTM) architecture offers linear scalability. However, xLSTM-based models remain unexplored for speech enhancement. This paper introduces xLSTM-SENet, the first xLSTM-based single-channel speech enhancement system. A comparative analysis reveals that xLSTM-and notably, even LSTM-can match or outperform state-of-the-art Mamba- and Conformer-based systems across various model sizes in speech enhancement on the VoiceBank+Demand dataset. Through ablation studies, we identify key architectural design choices such as exponential gating and bidirectionality contributing to its effectiveness. Our best xLSTM-based model, xLSTM-SENet2, outperforms state-of-the-art Mamba- and Conformer-based systems of similar complexity on the Voicebank+DEMAND dataset.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Speech Enhancement | VoiceBank + DEMAND | CBAK | 3.98 | xLSTM-SENet2 |
| Speech Enhancement | VoiceBank + DEMAND | COVL | 4.27 | xLSTM-SENet2 |
| Speech Enhancement | VoiceBank + DEMAND | CSIG | 4.78 | xLSTM-SENet2 |
| Speech Enhancement | VoiceBank + DEMAND | PESQ (wb) | 3.53 | xLSTM-SENet2 |
| Speech Enhancement | VoiceBank + DEMAND | Para. (M) | 2.27 | xLSTM-SENet2 |
| Speech Enhancement | VoiceBank + DEMAND | STOI | 0.96 | xLSTM-SENet2 |