TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

Samuel Pegg, Kai Li, Xiaolin Hu

2024-01-25Speech Recognition speech-recognition Speech Separation

Abstract

Audio-visual speech separation has gained significant traction in recent years due to its potential applications in various fields such as speech recognition, diarization, scene analysis and assistive technologies. Designing a lightweight audio-visual speech separation network is important for low-latency applications, but existing methods often require higher computational costs and more parameters to achieve better separation performance. In this paper, we present an audio-visual speech separation model called Top-Down-Fusion Net (TDFNet), a state-of-the-art (SOTA) model for audio-visual speech separation, which builds upon the architecture of TDANet, an audio-only speech separation method. TDANet serves as the architectural foundation for the auditory and visual networks within TDFNet, offering an efficient model with fewer parameters. On the LRS2-2Mix dataset, TDFNet achieves a performance increase of up to 10\% across all performance metrics compared with the previous SOTA method CTCNet. Remarkably, these results are achieved using fewer parameters and only 28\% of the multiply-accumulate operations (MACs) of CTCNet. In essence, our method presents a highly effective and efficient solution to the challenges of speech separation within the audio-visual domain, making significant strides in harnessing visual information optimally.

Results

Task	Dataset	Metric	Value	Model
Speech Separation	LRS2	PESQ	3.21	TDFNet-large
Speech Separation	LRS2	SDRi	15.9	TDFNet-large
Speech Separation	LRS2	SI-SNRi	15.8	TDFNet-large
Speech Separation	LRS2	STOI	0.949	TDFNet-large
Speech Separation	LRS2	PESQ	3.16	TDFNet (MHSA + Shared)
Speech Separation	LRS2	SDRi	15.2	TDFNet (MHSA + Shared)
Speech Separation	LRS2	SI-SNRi	15	TDFNet (MHSA + Shared)
Speech Separation	LRS2	STOI	0.938	TDFNet (MHSA + Shared)
Speech Separation	LRS2	PESQ	3.1	TDFNet-small
Speech Separation	LRS2	SDRi	13.7	TDFNet-small
Speech Separation	LRS2	SI-SNRi	13.6	TDFNet-small
Speech Separation	LRS2	STOI	0.931	TDFNet-small

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

Abstract

Results

Related Papers

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

Abstract

Results

Related Papers