MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

2025-07-01Speech Recognition Automatic Speech Recognition speech-recognition Speech Enhancement

Abstract

With the advent of new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform state-of-the-art models in single-channel speech enhancement, automatic speech recognition, and self-supervised audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this issue, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VoiceBank+Demand Extended (VB-DemandEx), a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, our proposed MambAttention model significantly outperforms existing state-of-the-art LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 and EARS-WHAM_v2, while matching their performance on the in-domain dataset VB-DemandEx. Ablation studies highlight the role of weight sharing between the time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. However, our MambAttention model remains superior on both out-of-domain datasets across all reported evaluation metrics.

Results

Task	Dataset	Metric	Value	Model
Speech Enhancement	VB-DemandEx	ESTOI	0.801	MambAttention
Speech Enhancement	VB-DemandEx	Number of parameters (M)	2.33	MambAttention
Speech Enhancement	VB-DemandEx	PESQ (wb)	3.026	MambAttention
Speech Enhancement	VB-DemandEx	SI-SDR	16.684	MambAttention
Speech Enhancement	VB-DemandEx	SSNR	7.674	MambAttention
Speech Enhancement	VB-DemandEx	ESTOI	0.8	SEMamba
Speech Enhancement	VB-DemandEx	Number of parameters (M)	2.25	SEMamba
Speech Enhancement	VB-DemandEx	PESQ (wb)	3.002	SEMamba
Speech Enhancement	VB-DemandEx	SI-SDR	16.593	SEMamba
Speech Enhancement	VB-DemandEx	SSNR	7.59	SEMamba
Speech Enhancement	VB-DemandEx	ESTOI	0.795	xLSTM-SENet
Speech Enhancement	VB-DemandEx	Number of parameters (M)	2.2	xLSTM-SENet
Speech Enhancement	VB-DemandEx	PESQ (wb)	2.973	xLSTM-SENet
Speech Enhancement	VB-DemandEx	SI-SDR	16.414	xLSTM-SENet
Speech Enhancement	VB-DemandEx	SSNR	7.933	xLSTM-SENet
Speech Enhancement	VB-DemandEx	ESTOI	0.787	MP-SENet
Speech Enhancement	VB-DemandEx	Number of parameters (M)	2.05	MP-SENet
Speech Enhancement	VB-DemandEx	PESQ (wb)	2.935	MP-SENet
Speech Enhancement	VB-DemandEx	SI-SDR	16.202	MP-SENet
Speech Enhancement	VB-DemandEx	SSNR	7.641	MP-SENet
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	ESTOI	95.9	MambAttention
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	Number of parameters (M)	2.33	MambAttention
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	PESQ-WB	3.671	MambAttention
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	SI-SDR-WB	21.234	MambAttention
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	SSNR	15.116	MambAttention
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	ESTOI	95.4	xLSTM-SENet
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	Number of parameters (M)	2.2	xLSTM-SENet
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	PESQ-WB	3.588	xLSTM-SENet
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	SI-SDR-WB	20.854	xLSTM-SENet
Speech Enhancement	Deep Noise Suppression (DNS) Challenge	SSNR	14.526	xLSTM-SENet

MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

Abstract

Results

Related Papers

MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

Abstract

Results

Related Papers