Avi Gazneli, Gadi Zimerman, Tal Ridnik, Gilad Sharir, Asaf Noy
While efficient architectures and a plethora of augmentations for end-to-end image classification tasks have been suggested and heavily investigated, state-of-the-art techniques for audio classifications still rely on numerous representations of the audio signal together with large architectures, fine-tuned from large datasets. By utilizing the inherited lightweight nature of audio and novel audio augmentations, we were able to present an efficient end-to-end network with strong generalization ability. Experiments on a variety of sound classification sets demonstrate the effectiveness and robustness of our approach, by achieving state-of-the-art results in various settings. Public code is available at: \href{https://github.com/Alibaba-MIIL/AudioClassfication}{this http url}
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Keyword Spotting | Google Speech Commands | Google Speech Commands V2 35 | 98.15 | EAT-S |
| Audio Classification | ESC-50 | Accuracy (5-fold) | 96.3 | EAT-M |
| Audio Classification | ESC-50 | Top-1 Accuracy | 96.3 | EAT-M |
| Audio Classification | ESC-50 | Accuracy (5-fold) | 95.25 | EAT-S |
| Audio Classification | ESC-50 | Top-1 Accuracy | 95.25 | EAT-S |
| Audio Classification | ESC-50 | Accuracy (5-fold) | 92.15 | EAT-S (scratch) |
| Audio Classification | ESC-50 | Top-1 Accuracy | 92.15 | EAT-S (scratch) |
| Audio Classification | AudioSet | Test mAP | 0.426 | EAT-M |
| Audio Classification | AudioSet | Test mAP | 0.405 | EAT-S |
| Classification | ESC-50 | Accuracy (5-fold) | 96.3 | EAT-M |
| Classification | ESC-50 | Top-1 Accuracy | 96.3 | EAT-M |
| Classification | ESC-50 | Accuracy (5-fold) | 95.25 | EAT-S |
| Classification | ESC-50 | Top-1 Accuracy | 95.25 | EAT-S |
| Classification | ESC-50 | Accuracy (5-fold) | 92.15 | EAT-S (scratch) |
| Classification | ESC-50 | Top-1 Accuracy | 92.15 | EAT-S (scratch) |
| Classification | AudioSet | Test mAP | 0.426 | EAT-M |
| Classification | AudioSet | Test mAP | 0.405 | EAT-S |