TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CLAPSep: Leveraging Contrastive Pre-trained Model for Mult...

CLAPSep: Leveraging Contrastive Pre-trained Model for Multi-Modal Query-Conditioned Target Sound Extraction

Hao Ma, Zhiyuan Peng, Xu Li, Mingjie Shao, Xixin Wu, Ju Liu

2024-02-27Target Sound Extraction
PaperPDFCode(official)

Abstract

Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin. Full codes and some audio examples are released for reproduction and evaluation.

Results

TaskDatasetMetricValueModel
Audio Source SeparationAudioSetSDRi9.29CLAPSep
Audio Source SeparationAudioSetSI-SDRi8.44CLAPSep
Audio Source SeparationAudioCapsSDRi10.08CLAPSep
Audio Source SeparationAudioCapsSI-SDRi9.4CLAPSep
Target Sound ExtractionAudioSetSDRi9.29CLAPSep
Target Sound ExtractionAudioSetSI-SDRi8.44CLAPSep
Target Sound ExtractionAudioCapsSDRi10.08CLAPSep
Target Sound ExtractionAudioCapsSI-SDRi9.4CLAPSep

Related Papers

SoundSculpt: Direction and Semantics Driven Ambisonic Target Sound Extraction2025-05-30Leveraging Audio-Only Data for Text-Queried Target Sound Extraction2024-09-20Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues2024-09-19Language-Queried Target Sound Extraction Without Parallel Training Data2024-09-14SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer2024-09-12Cross-attention Inspired Selective State Space Models for Target Sound Extraction2024-09-07Can all variations within the unified mask-based beamformer framework achieve identical peak extraction performance?2024-07-22CATSE: A Context-Aware Framework for Causal Target Sound Extraction2024-03-21