Nathanaƫl Carraz Rakotonirina
Convolutions operate only locally, thus failing to model global interactions. Self-attention is, however, able to learn representations that capture long-range dependencies in sequences. We propose a network architecture for audio super-resolution that combines convolution and self-attention. Attention-based Feature-Wise Linear Modulation (AFiLM) uses self-attention mechanism instead of recurrent neural networks to modulate the activations of the convolutional model. Extensive experiments show that our model outperforms existing approaches on standard benchmarks. Moreover, it allows for more parallelization resulting in significantly faster training.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Audio Generation | Piano | Log-Spectral Distance | 1.5 | U-Net + AFiLM |
| Audio Generation | VCTK Multi-Speaker | Log-Spectral Distance | 1.7 | U-Net + AFiLM |
| Audio Generation | Voice Bank corpus (VCTK) | Log-Spectral Distance | 2.3 | U-Net + AFiLM |
| 10-shot image generation | Piano | Log-Spectral Distance | 1.5 | U-Net + AFiLM |
| 10-shot image generation | VCTK Multi-Speaker | Log-Spectral Distance | 1.7 | U-Net + AFiLM |
| 10-shot image generation | Voice Bank corpus (VCTK) | Log-Spectral Distance | 2.3 | U-Net + AFiLM |
| Audio Super-Resolution | Piano | Log-Spectral Distance | 1.5 | U-Net + AFiLM |
| Audio Super-Resolution | VCTK Multi-Speaker | Log-Spectral Distance | 1.7 | U-Net + AFiLM |
| Audio Super-Resolution | Voice Bank corpus (VCTK) | Log-Spectral Distance | 2.3 | U-Net + AFiLM |