TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Efficient Training of Audio Transformers with Patchout

Efficient Training of Audio Transformers with Patchout

Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, Gerhard Widmer

2021-10-11Audio ClassificationAudio TaggingInstrument RecognitionAcoustic Scene Classification
PaperPDFCode(official)Code

Abstract

The great success of transformer-based models in natural language processing (NLP) has led to various attempts at adapting these architectures to other domains such as vision and audio. Recent work has shown that transformers can outperform Convolutional Neural Networks (CNNs) on vision and audio tasks. However, one of the main shortcomings of transformer models, compared to the well-established CNNs, is the computational complexity. In transformers, the compute and memory complexity is known to grow quadratically with the input length. Therefore, there has been extensive work on optimizing transformers, but often at the cost of degrading predictive performance. In this work, we propose a novel method to optimize and regularize transformers on audio spectrograms. Our proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU. Furthermore, we propose a transformer model that outperforms CNNs in terms of both performance and training speed. Source code: https://github.com/kkoutini/PaSST

Results

TaskDatasetMetricValueModel
Audio ClassificationFSD50KmAP65.55PaSST-S
Audio ClassificationFSD50KmAP64.2PaSST-N-S
Audio ClassificationAudioSetTest mAP0.496PaSST (Ensemble)
Audio ClassificationAudioSetTest mAP0.471PaSST-S (Single)
Audio TaggingAudioSetmean average precision0.496PaSST
ClassificationFSD50KmAP65.55PaSST-S
ClassificationFSD50KmAP64.2PaSST-N-S
ClassificationAudioSetTest mAP0.496PaSST (Ensemble)
ClassificationAudioSetTest mAP0.471PaSST-S (Single)
Instrument RecognitionOpenMIC-2018mean average precision0.843PaSST

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 42025-06-26A Hierarchical Deep Learning Approach for Minority Instrument Detection2025-06-26Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23USAD: Universal Speech and Audio Representation via Distillation2025-06-23Adaptive Differential Denoising for Respiratory Sounds Classification2025-06-03