TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/AST: Audio Spectrogram Transformer

AST: Audio Spectrogram Transformer

Yuan Gong, Yu-An Chung, James Glass

2021-04-05Keyword SpottingAudio ClassificationAudio TaggingSpeech Emotion RecognitionGeneral ClassificationClassificationTime Series Analysis
PaperPDFCodeCodeCodeCode(official)Code

Abstract

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

Results

TaskDatasetMetricValueModel
Keyword SpottingGoogle Speech CommandsGoogle Speech Commands V2 3598.11Audio Spectrogram Transformer
Emotion RecognitionCREMA-DAccuracy67.81ViT
Audio ClassificationESC-50Accuracy (5-fold)95.7Audio Spectrogram Transformer
Audio ClassificationESC-50Top-1 Accuracy95.7Audio Spectrogram Transformer
Audio ClassificationAudioSetTest mAP0.485AST (Ensemble)
Audio ClassificationAudioSetTest mAP0.459AST (Single)
Audio TaggingAudioSetmean average precision0.485Audio Spectrogram Transformer
Time Series AnalysisSpeech Commands% Test Accuracy98.11ViT
ClassificationESC-50Accuracy (5-fold)95.7Audio Spectrogram Transformer
ClassificationESC-50Top-1 Accuracy95.7Audio Spectrogram Transformer
ClassificationAudioSetTest mAP0.485AST (Ensemble)
ClassificationAudioSetTest mAP0.459AST (Single)
Speech Emotion RecognitionCREMA-DAccuracy67.81ViT

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Emergence of Functionally Differentiated Structures via Mutual Information Optimization in Recurrent Neural Networks2025-07-17Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Safeguarding Federated Learning-based Road Condition Classification2025-07-16AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)2025-07-13Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11