Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions

Yan Ru Pei

2025-01-22Speech Recognition Keyword Spotting Denoising Automatic Speech Recognition Automatic Speech Recognition (ASR)speech-recognition Speech Enhancement Speech Denoising

Paper PDF

Abstract

We introduce Centaurus, a class of networks composed of generalized state-space model (SSM) blocks, where the SSM operations can be treated as tensor contractions during training. The optimal order of tensor contractions can then be systematically determined for every SSM block to maximize training efficiency. This allows more flexibility in designing SSM blocks beyond the depthwise-separable configuration commonly implemented. The new design choices will take inspiration from classical convolutional blocks including group convolutions, full convolutions, and bottleneck blocks. We architect the Centaurus network with a mixture of these blocks, to balance between network size and performance, as well as memory and computational efficiency during both training and inference. We show that this heterogeneous network design outperforms its homogeneous counterparts in raw audio processing tasks including keyword spotting, speech denoising, and automatic speech recognition (ASR). For ASR, Centaurus is the first network with competitive performance that can be made fully state-space based, without using any nonlinear recurrence (LSTMs), explicit convolutions (CNNs), or (surrogate) attention mechanism. Source code is available at github.com/Brainchip-Inc/Centaurus

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	Speech Commands	Accuracy (%)	98.53	Centaurus
Speech Recognition	LibriSpeech test-clean	Word Error Rate (WER)	4.4	Centaurus (30 M)
Speech Enhancement	VoiceBank + DEMAND	PESQ (wb)	3.25	Centaurus (0.51M)

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17 Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models2025-07-17 Autoregressive Speech Enhancement via Acoustic Tokens2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 HUG-VAS: A Hierarchical NURBS-Based Generative Model for Aortic Geometry Synthesis and Controllable Editing2025-07-15 AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air2025-07-15