TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TDFNet: An Efficient Audio-Visual Speech Separation Model ...

TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down Fusion

Samuel Pegg, Kai Li, Xiaolin Hu

2024-01-25Speech Recognitionspeech-recognitionSpeech Separation
PaperPDFCode(official)

Abstract

Audio-visual speech separation has gained significant traction in recent years due to its potential applications in various fields such as speech recognition, diarization, scene analysis and assistive technologies. Designing a lightweight audio-visual speech separation network is important for low-latency applications, but existing methods often require higher computational costs and more parameters to achieve better separation performance. In this paper, we present an audio-visual speech separation model called Top-Down-Fusion Net (TDFNet), a state-of-the-art (SOTA) model for audio-visual speech separation, which builds upon the architecture of TDANet, an audio-only speech separation method. TDANet serves as the architectural foundation for the auditory and visual networks within TDFNet, offering an efficient model with fewer parameters. On the LRS2-2Mix dataset, TDFNet achieves a performance increase of up to 10\% across all performance metrics compared with the previous SOTA method CTCNet. Remarkably, these results are achieved using fewer parameters and only 28\% of the multiply-accumulate operations (MACs) of CTCNet. In essence, our method presents a highly effective and efficient solution to the challenges of speech separation within the audio-visual domain, making significant strides in harnessing visual information optimally.

Results

TaskDatasetMetricValueModel
Speech SeparationLRS2PESQ3.21TDFNet-large
Speech SeparationLRS2SDRi15.9TDFNet-large
Speech SeparationLRS2SI-SNRi15.8TDFNet-large
Speech SeparationLRS2STOI0.949TDFNet-large
Speech SeparationLRS2PESQ3.16TDFNet (MHSA + Shared)
Speech SeparationLRS2SDRi15.2TDFNet (MHSA + Shared)
Speech SeparationLRS2SI-SNRi15TDFNet (MHSA + Shared)
Speech SeparationLRS2STOI0.938TDFNet (MHSA + Shared)
Speech SeparationLRS2PESQ3.1TDFNet-small
Speech SeparationLRS2SDRi13.7TDFNet-small
Speech SeparationLRS2SI-SNRi13.6TDFNet-small
Speech SeparationLRS2STOI0.931TDFNet-small

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08Dynamic Slimmable Networks for Efficient Speech Separation2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01