TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MambAttention: Mamba with Multi-Head Attention for General...

MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement

Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

2025-07-01Speech RecognitionAutomatic Speech Recognitionspeech-recognitionSpeech Enhancement
PaperPDFCode(official)Code

Abstract

With the advent of new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform state-of-the-art models in single-channel speech enhancement, automatic speech recognition, and self-supervised audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this issue, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VoiceBank+Demand Extended (VB-DemandEx), a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, our proposed MambAttention model significantly outperforms existing state-of-the-art LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 and EARS-WHAM_v2, while matching their performance on the in-domain dataset VB-DemandEx. Ablation studies highlight the role of weight sharing between the time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. However, our MambAttention model remains superior on both out-of-domain datasets across all reported evaluation metrics.

Results

TaskDatasetMetricValueModel
Speech EnhancementVB-DemandExESTOI0.801MambAttention
Speech EnhancementVB-DemandExNumber of parameters (M)2.33MambAttention
Speech EnhancementVB-DemandExPESQ (wb)3.026MambAttention
Speech EnhancementVB-DemandExSI-SDR16.684MambAttention
Speech EnhancementVB-DemandExSSNR7.674MambAttention
Speech EnhancementVB-DemandExESTOI0.8SEMamba
Speech EnhancementVB-DemandExNumber of parameters (M)2.25SEMamba
Speech EnhancementVB-DemandExPESQ (wb)3.002SEMamba
Speech EnhancementVB-DemandExSI-SDR16.593SEMamba
Speech EnhancementVB-DemandExSSNR7.59SEMamba
Speech EnhancementVB-DemandExESTOI0.795xLSTM-SENet
Speech EnhancementVB-DemandExNumber of parameters (M)2.2xLSTM-SENet
Speech EnhancementVB-DemandExPESQ (wb)2.973xLSTM-SENet
Speech EnhancementVB-DemandExSI-SDR16.414xLSTM-SENet
Speech EnhancementVB-DemandExSSNR7.933xLSTM-SENet
Speech EnhancementVB-DemandExESTOI0.787MP-SENet
Speech EnhancementVB-DemandExNumber of parameters (M)2.05MP-SENet
Speech EnhancementVB-DemandExPESQ (wb)2.935MP-SENet
Speech EnhancementVB-DemandExSI-SDR16.202MP-SENet
Speech EnhancementVB-DemandExSSNR7.641MP-SENet
Speech EnhancementDeep Noise Suppression (DNS) ChallengeESTOI95.9MambAttention
Speech EnhancementDeep Noise Suppression (DNS) ChallengeNumber of parameters (M)2.33MambAttention
Speech EnhancementDeep Noise Suppression (DNS) ChallengePESQ-WB3.671MambAttention
Speech EnhancementDeep Noise Suppression (DNS) ChallengeSI-SDR-WB21.234MambAttention
Speech EnhancementDeep Noise Suppression (DNS) ChallengeSSNR15.116MambAttention
Speech EnhancementDeep Noise Suppression (DNS) ChallengeESTOI95.4xLSTM-SENet
Speech EnhancementDeep Noise Suppression (DNS) ChallengeNumber of parameters (M)2.2xLSTM-SENet
Speech EnhancementDeep Noise Suppression (DNS) ChallengePESQ-WB3.588xLSTM-SENet
Speech EnhancementDeep Noise Suppression (DNS) ChallengeSI-SDR-WB20.854xLSTM-SENet
Speech EnhancementDeep Noise Suppression (DNS) ChallengeSSNR14.526xLSTM-SENet

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Autoregressive Speech Enhancement via Acoustic Tokens2025-07-17P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge2025-07-15WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08Robust One-step Speech Enhancement via Consistency Distillation2025-07-08Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08