Efficient Training of Audio Transformers with Patchout

Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, Gerhard Widmer

2021-10-11Audio Classification Audio Tagging Instrument Recognition Acoustic Scene Classification

Abstract

The great success of transformer-based models in natural language processing (NLP) has led to various attempts at adapting these architectures to other domains such as vision and audio. Recent work has shown that transformers can outperform Convolutional Neural Networks (CNNs) on vision and audio tasks. However, one of the main shortcomings of transformer models, compared to the well-established CNNs, is the computational complexity. In transformers, the compute and memory complexity is known to grow quadratically with the input length. Therefore, there has been extensive work on optimizing transformers, but often at the cost of degrading predictive performance. In this work, we propose a novel method to optimize and regularize transformers on audio spectrograms. Our proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU. Furthermore, we propose a transformer model that outperforms CNNs in terms of both performance and training speed. Source code: https://github.com/kkoutini/PaSST

Results

Task	Dataset	Metric	Value	Model
Audio Classification	FSD50K	mAP	65.55	PaSST-S
Audio Classification	FSD50K	mAP	64.2	PaSST-N-S
Audio Classification	AudioSet	Test mAP	0.496	PaSST (Ensemble)
Audio Classification	AudioSet	Test mAP	0.471	PaSST-S (Single)
Audio Tagging	AudioSet	mean average precision	0.496	PaSST
Classification	FSD50K	mAP	65.55	PaSST-S
Classification	FSD50K	mAP	64.2	PaSST-N-S
Classification	AudioSet	Test mAP	0.496	PaSST (Ensemble)
Classification	AudioSet	Test mAP	0.471	PaSST-S (Single)
Instrument Recognition	OpenMIC-2018	mean average precision	0.843	PaSST

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 42025-06-26 A Hierarchical Deep Learning Approach for Minority Instrument Detection2025-06-26 Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24 Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23 USAD: Universal Speech and Audio Representation via Distillation2025-06-23 Adaptive Differential Denoising for Respiratory Sounds Classification2025-06-03