TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Multiscale Audio Spectrogram Transformer for Efficient Aud...

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Wentao Zhu, Mohamed Omar

2023-03-19Representation LearningAudio ClassificationClassification
PaperPDF

Abstract

Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals.

Results

TaskDatasetMetricValueModel
Audio ClassificationVGGSoundTop 1 Accuracy57MAST (Audio Only)
Audio ClassificationVGGSoundTop 5 Accuracy81.3MAST (Audio Only)
ClassificationVGGSoundTop 1 Accuracy57MAST (Audio Only)
ClassificationVGGSoundTop 5 Accuracy81.3MAST (Audio Only)

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16