Mixer is more than just a model

Qingfeng Ji, Yuxin Wang, Letong Sun

2024-02-28Environmental Sound Classification Audio Classification Speech Emotion Recognition

Abstract

Recently, MLP structures have regained popularity, with MLP-Mixer standing out as a prominent example. In the field of computer vision, MLP-Mixer is noted for its ability to extract data information from both channel and token perspectives, effectively acting as a fusion of channel and token information. Indeed, Mixer represents a paradigm for information extraction that amalgamates channel and token information. The essence of Mixer lies in its ability to blend information from diverse perspectives, epitomizing the true concept of "mixing" in the realm of neural network architectures. Beyond channel and token considerations, it is possible to create more tailored mixers from various perspectives to better suit specific task requirements. This study focuses on the domain of audio recognition, introducing a novel model named Audio Spectrogram Mixer with Roll-Time and Hermit FFT (ASM-RH) that incorporates insights from both time and frequency domains. Experimental results demonstrate that ASM-RH is particularly well-suited for audio data and yields promising outcomes across multiple classification tasks. The models and optimal weights files will be published.

Results

Task	Dataset	Metric	Value	Model
Audio Classification	Speech Commands	Accuracy	96.51	ASM-RH
Audio Classification	RAVDESS	Top-1 Accuracy	75.4	ASM-RH-A
Classification	Speech Commands	Accuracy	96.51	ASM-RH
Classification	RAVDESS	Top-1 Accuracy	75.4	ASM-RH-A

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11 Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons2025-06-24 MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition2025-06-24 Fully Few-shot Class-incremental Audio Classification Using Multi-level Embedding Extractor and Ridge Regression Classifier2025-06-23 Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction2025-06-12 MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions2025-06-11