TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/RTFS-Net: Recurrent Time-Frequency Modelling for Efficient...

RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation

Samuel Pegg, Kai Li, Xiaolin Hu

2023-09-29Speech Recognitionspeech-recognitionSpeech SeparationAudio-Visual Speech Recognition
PaperPDFCode(official)

Abstract

Audio-visual speech separation methods aim to integrate different modalities to generate high-quality separated speech, thereby enhancing the performance of downstream tasks such as speech recognition. Most existing state-of-the-art (SOTA) models operate in the time domain. However, their overly simplistic approach to modeling acoustic features often necessitates larger and more computationally intensive models in order to achieve SOTA performance. In this paper, we present a novel time-frequency domain audio-visual speech separation method: Recurrent Time-Frequency Separation Network (RTFS-Net), which applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform. We model and capture the time and frequency dimensions of the audio independently using a multi-layered RNN along each dimension. Furthermore, we introduce a unique attention-based fusion technique for the efficient integration of audio and visual information, and a new mask separation approach that takes advantage of the intrinsic spectral nature of the acoustic features for a clearer separation. RTFS-Net outperforms the prior SOTA method in both inference speed and separation quality while reducing the number of parameters by 90% and MACs by 83%. This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.

Results

TaskDatasetMetricValueModel
Speech SeparationLRS3SDRi17.6RTFS-Net-12
Speech SeparationLRS3SI-SNRi17.5RTFS-Net-12
Speech SeparationLRS3SDRi17.1RTFS-Net-6
Speech SeparationLRS3SI-SNRi16.9RTFS-Net-6
Speech SeparationLRS3SDRi15.6RTFS-Net-4
Speech SeparationLRS3SI-SNRi15.5RTFS-Net-4
Speech SeparationLRS2SDRi15.1RTFS-Net-12
Speech SeparationLRS2SI-SNRi14.9RTFS-Net-12
Speech SeparationLRS2SDRi14.8RTFS-Net-6
Speech SeparationLRS2SI-SNRi14.6RTFS-Net-6
Speech SeparationLRS2SDRi14.3RTFS-Net-4
Speech SeparationLRS2SI-SNRi14.1RTFS-Net-4
Speech SeparationVoxCeleb2SDRi13.6RTFS-Net-12
Speech SeparationVoxCeleb2SI-SNRi12.4RTFS-Net-12
Speech SeparationVoxCeleb2SDRi12.8RTFS-Net-6
Speech SeparationVoxCeleb2SI-SNRi11.8RTFS-Net-6
Speech SeparationVoxCeleb2SDRi12.4RTFS-Net-4
Speech SeparationVoxCeleb2SI-SNRi11.5RTFS-Net-4

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08Dynamic Slimmable Networks for Efficient Speech Separation2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01