Junyu Chen, Susmitha Vekkot, Pancham Shukla
Music source separation (MSS) aims to extract 'vocals', 'drums', 'bass' and 'other' tracks from a piece of mixed music. While deep learning methods have shown impressive results, there is a trend toward larger models. In our paper, we introduce a novel and lightweight architecture called DTTNet, which is based on Dual-Path Module and Time-Frequency Convolutions Time-Distributed Fully-connected UNet (TFC-TDF UNet). DTTNet achieves 10.12 dB cSDR on 'vocals' compared to 10.01 dB reported for Bandsplit RNN (BSRNN) but with 86.7% fewer parameters. We also assess pattern-specific performance and model generalization for intricate audio patterns.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Music Source Separation | MUSDB18-HQ | SDR (avg) | 8.15 | Dual-Path TFC-TDF UNet (DTTNet) |
| Music Source Separation | MUSDB18-HQ | SDR (bass) | 7.55 | Dual-Path TFC-TDF UNet (DTTNet) |
| Music Source Separation | MUSDB18-HQ | SDR (drums) | 7.82 | Dual-Path TFC-TDF UNet (DTTNet) |
| Music Source Separation | MUSDB18-HQ | SDR (others) | 7.02 | Dual-Path TFC-TDF UNet (DTTNet) |
| Music Source Separation | MUSDB18-HQ | SDR (vocals) | 10.21 | Dual-Path TFC-TDF UNet (DTTNet) |
| 2D Classification | MUSDB18-HQ | SDR (avg) | 8.15 | Dual-Path TFC-TDF UNet (DTTNet) |
| 2D Classification | MUSDB18-HQ | SDR (bass) | 7.55 | Dual-Path TFC-TDF UNet (DTTNet) |
| 2D Classification | MUSDB18-HQ | SDR (drums) | 7.82 | Dual-Path TFC-TDF UNet (DTTNet) |
| 2D Classification | MUSDB18-HQ | SDR (others) | 7.02 | Dual-Path TFC-TDF UNet (DTTNet) |
| 2D Classification | MUSDB18-HQ | SDR (vocals) | 10.21 | Dual-Path TFC-TDF UNet (DTTNet) |