CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

Sherif Abdulatif, Ruizhe Cao, Bin Yang

2022-09-22Speech Recognition Denoising Automatic Speech Recognition Super-Resolution Automatic Speech Recognition (ASR)speech-recognition Audio Super-Resolution Speech Separation Speech Enhancement Speech Denoising

Paper PDF Code(official)Code(official)

Abstract

In this work, we further develop the conformer-based metric generative adversarial network (CMGAN) model for speech enhancement (SE) in the time-frequency (TF) domain. This paper builds on our previous work but takes a more in-depth look by conducting extensive ablation studies on model inputs and architectural design choices. We rigorously tested the generalization ability of the model to unseen noise types and distortions. We have fortified our claims through DNS-MOS measurements and listening tests. Rather than focusing exclusively on the speech denoising task, we extend this work to address the dereverberation and super-resolution tasks. This necessitated exploring various architectural changes, specifically metric discriminator scores and masking techniques. It is essential to highlight that this is among the earliest works that attempted complex TF-domain super-resolution. Our findings show that CMGAN outperforms existing state-of-the-art methods in the three major speech enhancement tasks: denoising, dereverberation, and super-resolution. For example, in the denoising task using the Voice Bank+DEMAND dataset, CMGAN notably exceeded the performance of prior models, attaining a PESQ score of 3.41 and an SSNR of 11.10 dB. Audio samples and CMGAN implementations are available online.

Results

Task	Dataset	Metric	Value	Model
Audio Generation	VCTK Multi-Speaker	Log-Spectral Distance	0.76	CMGAN
Speech Enhancement	VoiceBank + DEMAND	CBAK	3.94	CMGAN
Speech Enhancement	VoiceBank + DEMAND	COVL	4.12	CMGAN
Speech Enhancement	VoiceBank + DEMAND	CSIG	4.63	CMGAN
Speech Enhancement	VoiceBank + DEMAND	PESQ (wb)	3.41	CMGAN
Speech Enhancement	VoiceBank + DEMAND	SSNR	11.1	CMGAN
Speech Enhancement	VoiceBank + DEMAND	STOI	96	CMGAN
10-shot image generation	VCTK Multi-Speaker	Log-Spectral Distance	0.76	CMGAN
Audio Super-Resolution	VCTK Multi-Speaker	Log-Spectral Distance	0.76	CMGAN

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

Abstract

Results

Related Papers

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

Abstract

Results

Related Papers