SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad Norouzi

2021-04-05Speech Recognition speech-recognition Transfer Learning All Language Modelling

Abstract

We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	Switchboard CallHome	Word Error Rate (WER)	8.3	SpeechStew (100M)
Speech Recognition	AMI IMH	Word Error Rate (WER)	9	SpeechStew (100M)
Speech Recognition	Tedlium	Word Error Rate (WER)	5.3	SpeechStew (100M)
Speech Recognition	CHiME-6 eval	Word Error Rate (WER)	38.9	SpeechStew (1B)
Speech Recognition	WSJ eval92	Word Error Rate (WER)	1.3	Speechstew 100M
Speech Recognition	Switchboard SWBD	Word Error Rate (WER)	4.7	SpeechStew (100M)
Speech Recognition	AMI SDM1	Word Error Rate (WER)	21.7	SpeechStew (100M)
Speech Recognition	CHiME-6 dev_gss12	Word Error Rate (WER)	31.9	SpeechStew (1B)
Speech Recognition	LibriSpeech test-clean	Word Error Rate (WER)	1.7	SpeechStew (1B)
Speech Recognition	LibriSpeech test-clean	Word Error Rate (WER)	2	SpeechStew (100M)
Speech Recognition	LibriSpeech test-other	Word Error Rate (WER)	3.3	SpeechStew (1B)
Speech Recognition	LibriSpeech test-other	Word Error Rate (WER)	4	SpeechStew (100M)

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

Abstract

Results

Related Papers

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

Abstract

Results

Related Papers