Tingle Li, Jia-Wei Chen, Haowen Hou, Ming Li
Convolutional Neural Network (CNN) or Long short-term memory (LSTM) based models with the input of spectrogram or waveforms are commonly used for deep learning based audio source separation. In this paper, we propose a Sliced Attention-based neural network (Sams-Net) in the spectrogram domain for the music source separation task. It enables spectral feature interactions with multi-head attention mechanism, achieves easier parallel computing and has a larger receptive field compared with LSTMs and CNNs respectively. Experimental results on the MUSDB18 dataset show that the proposed method, with fewer parameters, outperforms most of the state-of-the-art DNN-based methods.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Music Source Separation | MUSDB18 | SDR (avg) | 5.65 | Sams-Net |
| Music Source Separation | MUSDB18 | SDR (bass) | 5.25 | Sams-Net |
| Music Source Separation | MUSDB18 | SDR (drums) | 6.63 | Sams-Net |
| Music Source Separation | MUSDB18 | SDR (other) | 4.09 | Sams-Net |
| Music Source Separation | MUSDB18 | SDR (vocals) | 6.61 | Sams-Net |
| 2D Classification | MUSDB18 | SDR (avg) | 5.65 | Sams-Net |
| 2D Classification | MUSDB18 | SDR (bass) | 5.25 | Sams-Net |
| 2D Classification | MUSDB18 | SDR (drums) | 6.63 | Sams-Net |
| 2D Classification | MUSDB18 | SDR (other) | 4.09 | Sams-Net |
| 2D Classification | MUSDB18 | SDR (vocals) | 6.61 | Sams-Net |