BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, Yonghui Wu

2021-09-27Speech Recognition Automatic Speech Recognition Language Identification Automatic Speech Recognition (ASR)speech-recognition Speech Emotion Recognition

Paper PDF

Abstract

We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	AMI IMH	Word Error Rate (WER)	7.8	ConformerXXL-P + Downstream NST
Speech Recognition	CHiME-6 eval	Word Error Rate (WER)	31	ConformerXXL-PS
Speech Recognition	WSJ eval92	Word Error Rate (WER)	1.3	ConformerXXL-P
Speech Recognition	AMI SDM1	Word Error Rate (WER)	17.7	ConformerXXL-P
Speech Recognition	CHiME-6 dev_gss12	Word Error Rate (WER)	26.2	ConformerXXL-PS
Speech Recognition	TED-LIUM	Word Error Rate (WER)	5	ConformerXXL-PS
Emotion Recognition	CREMA-D	Accuracy	88.2	ConformerXL-P
Language Identification	VoxForge	Accuracy	99.8	ConformerG-P
Speech Emotion Recognition	CREMA-D	Accuracy	88.2	ConformerXL-P

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Abstract

Results

Related Papers

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Abstract

Results

Related Papers