TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/HTS-AT: A Hierarchical Token-Semantic Audio Transformer fo...

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

2022-02-02Keyword SpottingSound ClassificationAudio ClassificationSound Event DetectionEvent Detection
PaperPDFCode(official)

Abstract

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

Results

TaskDatasetMetricValueModel
Keyword SpottingGoogle Speech CommandsGoogle Speech Commands V2 3598HTS-AT
Audio ClassificationESC-50Accuracy (5-fold)97HTS-AT
Audio ClassificationESC-50Top-1 Accuracy97HTS-AT
Audio ClassificationAudioSetTest mAP0.487HTS-AT (Ensemble)
Sound Event DetectionDESEDevent-based F1 score50.7HTS-AT
ClassificationESC-50Accuracy (5-fold)97HTS-AT
ClassificationESC-50Top-1 Accuracy97HTS-AT
ClassificationAudioSetTest mAP0.487HTS-AT (Ensemble)

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24USAD: Universal Speech and Audio Representation via Distillation2025-06-23Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models2025-06-21Low-resource keyword spotting using contrastively trained transformer acoustic word embeddings2025-06-21ASAP-FE: Energy-Efficient Feature Extraction Enabling Multi-Channel Keyword Spotting on Edge Processors2025-06-17