SepTr: Separable Transformer for Audio Spectrogram Processing

Nicolae-Catalin Ristea, Radu Tudor Ionescu, Fahad Shahbaz Khan

2022-03-17Audio Classification Speech Emotion Recognition Time Series Analysis

Abstract

Following the successful application of vision transformers in multiple computer vision tasks, these models have drawn the attention of the signal processing community. This is because signals are often represented as spectrograms (e.g. through Discrete Fourier Transform) which can be directly provided as input to vision transformers. However, naively applying transformers to spectrograms is suboptimal. Since the axes represent distinct dimensions, i.e. frequency and time, we argue that a better approach is to separate the attention dedicated to each axis. To this end, we propose the Separable Transformer (SepTr), an architecture that employs two transformer blocks in a sequential manner, the first attending to tokens within the same time interval, and the second attending to tokens within the same frequency bin. We conduct experiments on three benchmark data sets, showing that our separable architecture outperforms conventional vision transformers and other state-of-the-art methods. Unlike standard transformers, SepTr linearly scales the number of trainable parameters with the input size, thus having a lower memory footprint. Our code is available as open source at https://github.com/ristea/septr.

Results

Task	Dataset	Metric	Value	Model
Emotion Recognition	CREMA-D	Accuracy	70.47	SepTr
Audio Classification	ESC-50	Top-1 Accuracy	91.13	SepTr
Time Series Analysis	Speech Commands	% Test Accuracy	98.51	SepTr
Classification	ESC-50	Top-1 Accuracy	91.13	SepTr
Speech Emotion Recognition	CREMA-D	Accuracy	70.47	SepTr

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 Emergence of Functionally Differentiated Structures via Mutual Information Optimization in Recurrent Neural Networks2025-07-17 Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11 Process mining-driven modeling and simulation to enhance fault diagnosis in cyber-physical systems2025-06-26 Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24 MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition2025-06-24 Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23