TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Hybrid Transformers for Music Source Separation

Hybrid Transformers for Music Source Separation

Simon Rouard, Francisco Massa, Alexandre Défossez

2022-11-15Speech EnhancementMusic Source Separation
PaperPDFCodeCode(official)

Abstract

A natural question arising in Music Source Separation (MSS) is whether long range contextual information is useful, or whether local acoustic features are sufficient. In other fields, attention based Transformers have shown their ability to integrate information over long sequences. In this work, we introduce Hybrid Transformer Demucs (HT Demucs), an hybrid temporal/spectral bi-U-Net based on Hybrid Demucs, where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains. While it performs poorly when trained only on MUSDB, we show that it outperforms Hybrid Demucs (trained on the same data) by 0.45 dB of SDR when using 800 extra training songs. Using sparse attention kernels to extend its receptive field, and per source fine-tuning, we achieve state-of-the-art results on MUSDB with extra training data, with 9.20 dB of SDR.

Results

TaskDatasetMetricValueModel
Music Source SeparationMUSDB18SDR (avg)9.2Sparse HT Demucs (fine tuned)
Music Source SeparationMUSDB18SDR (bass)10.47Sparse HT Demucs (fine tuned)
Music Source SeparationMUSDB18SDR (drums)10.83Sparse HT Demucs (fine tuned)
Music Source SeparationMUSDB18SDR (other)6.41Sparse HT Demucs (fine tuned)
Music Source SeparationMUSDB18SDR (vocals)9.37Sparse HT Demucs (fine tuned)
Music Source SeparationMUSDB18SDR (avg)9Hybrid Transformer Demucs (f.t.)
Music Source SeparationMUSDB18SDR (bass)9.78Hybrid Transformer Demucs (f.t.)
Music Source SeparationMUSDB18SDR (drums)10.08Hybrid Transformer Demucs (f.t.)
Music Source SeparationMUSDB18SDR (other)6.42Hybrid Transformer Demucs (f.t.)
Music Source SeparationMUSDB18SDR (vocals)9.2Hybrid Transformer Demucs (f.t.)
Music Source SeparationMUSDB18-HQSDR (avg)9.2Sparse HT Demucs (fine tuned)
Music Source SeparationMUSDB18-HQSDR (bass)10.47Sparse HT Demucs (fine tuned)
Music Source SeparationMUSDB18-HQSDR (drums)10.83Sparse HT Demucs (fine tuned)
Music Source SeparationMUSDB18-HQSDR (others)6.41Sparse HT Demucs (fine tuned)
Music Source SeparationMUSDB18-HQSDR (vocals)9.37Sparse HT Demucs (fine tuned)
Music Source SeparationMUSDB18-HQSDR (avg)9Hybrid Transformer Demucs (f.t.)
Music Source SeparationMUSDB18-HQSDR (bass)10.39Hybrid Transformer Demucs (f.t.)
Music Source SeparationMUSDB18-HQSDR (drums)10.08Hybrid Transformer Demucs (f.t.)
Music Source SeparationMUSDB18-HQSDR (others)6.32Hybrid Transformer Demucs (f.t.)
Music Source SeparationMUSDB18-HQSDR (vocals)9.2Hybrid Transformer Demucs (f.t.)
Speech EnhancementEARS-WHAMDNSMOS3.66Demucs v4
Speech EnhancementEARS-WHAMESTOI0.71Demucs v4
Speech EnhancementEARS-WHAMPESQ-WB2.37Demucs v4
Speech EnhancementEARS-WHAMPOLQA2.97Demucs v4
Speech EnhancementEARS-WHAMSI-SDR16.92Demucs v4
Speech EnhancementEARS-WHAMSIGMOS2.87Demucs v4
2D ClassificationMUSDB18SDR (avg)9.2Sparse HT Demucs (fine tuned)
2D ClassificationMUSDB18SDR (bass)10.47Sparse HT Demucs (fine tuned)
2D ClassificationMUSDB18SDR (drums)10.83Sparse HT Demucs (fine tuned)
2D ClassificationMUSDB18SDR (other)6.41Sparse HT Demucs (fine tuned)
2D ClassificationMUSDB18SDR (vocals)9.37Sparse HT Demucs (fine tuned)
2D ClassificationMUSDB18SDR (avg)9Hybrid Transformer Demucs (f.t.)
2D ClassificationMUSDB18SDR (bass)9.78Hybrid Transformer Demucs (f.t.)
2D ClassificationMUSDB18SDR (drums)10.08Hybrid Transformer Demucs (f.t.)
2D ClassificationMUSDB18SDR (other)6.42Hybrid Transformer Demucs (f.t.)
2D ClassificationMUSDB18SDR (vocals)9.2Hybrid Transformer Demucs (f.t.)
2D ClassificationMUSDB18-HQSDR (avg)9.2Sparse HT Demucs (fine tuned)
2D ClassificationMUSDB18-HQSDR (bass)10.47Sparse HT Demucs (fine tuned)
2D ClassificationMUSDB18-HQSDR (drums)10.83Sparse HT Demucs (fine tuned)
2D ClassificationMUSDB18-HQSDR (others)6.41Sparse HT Demucs (fine tuned)
2D ClassificationMUSDB18-HQSDR (vocals)9.37Sparse HT Demucs (fine tuned)
2D ClassificationMUSDB18-HQSDR (avg)9Hybrid Transformer Demucs (f.t.)
2D ClassificationMUSDB18-HQSDR (bass)10.39Hybrid Transformer Demucs (f.t.)
2D ClassificationMUSDB18-HQSDR (drums)10.08Hybrid Transformer Demucs (f.t.)
2D ClassificationMUSDB18-HQSDR (others)6.32Hybrid Transformer Demucs (f.t.)
2D ClassificationMUSDB18-HQSDR (vocals)9.2Hybrid Transformer Demucs (f.t.)

Related Papers

Autoregressive Speech Enhancement via Acoustic Tokens2025-07-17P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge2025-07-15Robust One-step Speech Enhancement via Consistency Distillation2025-07-08Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis2025-07-08MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01Frequency-Weighted Training Losses for Phoneme-Level DNN-based Speech Enhancement2025-06-23EDNet: A Distortion-Agnostic Speech Enhancement Framework with Gating Mamba Mechanism and Phase Shift-Invariant Training2025-06-19A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments2025-06-17