TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Real-Time Target Sound Extraction

Real-Time Target Sound Extraction

Bandhav Veluri, Justin Chan, Malek Itani, Tuochao Chen, Takuya Yoshioka, Shyamnath Gollakota

2022-11-04Streaming Target Sound ExtractionTarget Sound Extraction
PaperPDFCode(official)

Abstract

We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner while also leveraging the generalization performance of transformer-based architectures. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. We provide code, dataset, and audio samples: https://waveformer.cs.washington.edu/.

Results

TaskDatasetMetricValueModel
Audio Source SeparationFSDSoundScapesSI-SNRi9.43Waveformer
Audio Source SeparationFSDSoundScapesSI-SNRi9.43Waveformer
Target Sound ExtractionFSDSoundScapesSI-SNRi9.43Waveformer
Target Sound ExtractionFSDSoundScapesSI-SNRi9.43Waveformer

Related Papers

SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction2025-05-30Leveraging Audio-Only Data for Text-Queried Target Sound Extraction2024-09-20Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues2024-09-19Language-Queried Target Sound Extraction Without Parallel Training Data2024-09-14SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer2024-09-12Cross-attention Inspired Selective State Space Models for Target Sound Extraction2024-09-07Can all variations within the unified mask-based beamformer framework achieve identical peak extraction performance?2024-07-22CATSE: A Context-Aware Framework for Causal Target Sound Extraction2024-03-21