MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets

Ziyang Ma, Zhisheng Zheng, Changli Tang, Yujin Wang, Xie Chen

2022-11-14Speech Recognition Automatic Speech Recognition Representation Learning Self-Supervised Learning Multi-Task Learning

Paper PDF Code(official)

Abstract

In this paper, we provide a new perspective on self-supervised speech models from how the training targets are obtained. We generalize the targets extractor into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE). Based on this, we propose a new multi-tasking learning framework for self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. MT4SSL uses the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with fewer data. Furthermore, we find that using both Off-TE and On-TE results in better convergence in the pre-training phase. With both effectiveness and efficiency, we think doing multi-task learning on self-supervised speech models from our perspective is a promising trend.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	LibriSpeech test-clean	Word Error Rate (WER)	3.4	MT4SSL
Speech Recognition	LibriSpeech test-other	Word Error Rate (WER)	9.6	MT4SSL

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17 Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17 A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16