Jan Schlüter, Gerald Gutenbrunner
In audio classification, differentiable auditory filterbanks with few parameters cover the middle ground between hard-coded spectrograms and raw audio. LEAF (arXiv:2101.08596), a Gabor-based filterbank combined with Per-Channel Energy Normalization (PCEN), has shown promising results, but is computationally expensive. With inhomogeneous convolution kernel sizes and strides, and by replacing PCEN with better parallelizable operations, we can reach similar results more efficiently. In experiments on six audio classification tasks, our frontend matches the accuracy of LEAF at 3% of the cost, but both fail to consistently outperform a fixed mel filterbank. The quest for learnable audio frontends is not solved.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Dialogue | VoxForge | Accuracy | 91.5 | LEAF |
| Dialogue | VoxForge | Accuracy | 86.6 | EfficientLEAF |
| Dialogue | VoxForge | Accuracy | 85.6 | melspect |
| Spoken Language Understanding | VoxForge | Accuracy | 91.5 | LEAF |
| Spoken Language Understanding | VoxForge | Accuracy | 86.6 | EfficientLEAF |
| Spoken Language Understanding | VoxForge | Accuracy | 85.6 | melspect |
| Audio Classification | Speech Commands | Accuracy | 95.2 | EfficientLEAF |
| Audio Classification | Speech Commands | Accuracy | 95.1 | LEAF |
| Audio Classification | Speech Commands | Accuracy | 95.1 | melspect |
| Audio Classification | CREMA-D | Accuracy | 60.2 | EfficientLEAF |
| Audio Classification | CREMA-D | Accuracy | 58.8 | melspect |
| Audio Classification | CREMA-D | Accuracy | 50.2 | LEAF |
| Audio Classification | BirdCLEF 2021 | Accuracy | 72.2 | EfficientLEAF (8s) |
| Audio Classification | BirdCLEF 2021 | Accuracy | 42.9 | EfficientLEAF |
| Audio Classification | BirdCLEF 2021 | Accuracy | 42.3 | LEAF |
| Audio Classification | BirdCLEF 2021 | Accuracy | 39.9 | melspect |
| Dialogue Understanding | VoxForge | Accuracy | 91.5 | LEAF |
| Dialogue Understanding | VoxForge | Accuracy | 86.6 | EfficientLEAF |
| Dialogue Understanding | VoxForge | Accuracy | 85.6 | melspect |
| Classification | Speech Commands | Accuracy | 95.2 | EfficientLEAF |
| Classification | Speech Commands | Accuracy | 95.1 | LEAF |
| Classification | Speech Commands | Accuracy | 95.1 | melspect |
| Classification | CREMA-D | Accuracy | 60.2 | EfficientLEAF |
| Classification | CREMA-D | Accuracy | 58.8 | melspect |
| Classification | CREMA-D | Accuracy | 50.2 | LEAF |
| Classification | BirdCLEF 2021 | Accuracy | 72.2 | EfficientLEAF (8s) |
| Classification | BirdCLEF 2021 | Accuracy | 42.9 | EfficientLEAF |
| Classification | BirdCLEF 2021 | Accuracy | 42.3 | LEAF |
| Classification | BirdCLEF 2021 | Accuracy | 39.9 | melspect |
| Instrument Recognition | NSynth | Accuracy | 72.1 | melspect |
| Instrument Recognition | NSynth | Accuracy | 71.7 | EfficientLEAF |
| Instrument Recognition | NSynth | Accuracy | 69.2 | LEAF |