TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SepTr: Separable Transformer for Audio Spectrogram Process...

SepTr: Separable Transformer for Audio Spectrogram Processing

Nicolae-Catalin Ristea, Radu Tudor Ionescu, Fahad Shahbaz Khan

2022-03-17Audio ClassificationSpeech Emotion RecognitionTime Series Analysis
PaperPDFCode(official)

Abstract

Following the successful application of vision transformers in multiple computer vision tasks, these models have drawn the attention of the signal processing community. This is because signals are often represented as spectrograms (e.g. through Discrete Fourier Transform) which can be directly provided as input to vision transformers. However, naively applying transformers to spectrograms is suboptimal. Since the axes represent distinct dimensions, i.e. frequency and time, we argue that a better approach is to separate the attention dedicated to each axis. To this end, we propose the Separable Transformer (SepTr), an architecture that employs two transformer blocks in a sequential manner, the first attending to tokens within the same time interval, and the second attending to tokens within the same frequency bin. We conduct experiments on three benchmark data sets, showing that our separable architecture outperforms conventional vision transformers and other state-of-the-art methods. Unlike standard transformers, SepTr linearly scales the number of trainable parameters with the input size, thus having a lower memory footprint. Our code is available as open source at https://github.com/ristea/septr.

Results

TaskDatasetMetricValueModel
Emotion RecognitionCREMA-DAccuracy70.47SepTr
Audio ClassificationESC-50Top-1 Accuracy91.13SepTr
Time Series AnalysisSpeech Commands% Test Accuracy98.51SepTr
ClassificationESC-50Top-1 Accuracy91.13SepTr
Speech Emotion RecognitionCREMA-DAccuracy70.47SepTr

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Emergence of Functionally Differentiated Structures via Mutual Information Optimization in Recurrent Neural Networks2025-07-17Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11Process mining-driven modeling and simulation to enhance fault diagnosis in cyber-physical systems2025-06-26Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition2025-06-24Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23