TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SpeechStew: Simply Mix All Available Speech Recognition Da...

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

William Chan, Daniel Park, Chris Lee, Yu Zhang, Quoc Le, Mohammad Norouzi

2021-04-05Speech Recognitionspeech-recognitionTransfer LearningAllLanguage Modelling
PaperPDF

Abstract

We present SpeechStew, a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal. SpeechStew simply mixes all of these datasets together, without any special re-weighting or re-balancing of the datasets. SpeechStew achieves SoTA or near SoTA results across a variety of tasks, without the use of an external language model. Our results include 9.0\% WER on AMI-IHM, 4.7\% WER on Switchboard, 8.3\% WER on CallHome, and 1.3\% on WSJ, which significantly outperforms prior work with strong external language models. We also demonstrate that SpeechStew learns powerful transfer learning representations. We fine-tune SpeechStew on a noisy low resource speech dataset, CHiME-6. We achieve 38.9\% WER without a language model, which compares to 38.6\% WER to a strong HMM baseline with a language model.

Results

TaskDatasetMetricValueModel
Speech RecognitionSwitchboard CallHomeWord Error Rate (WER)8.3SpeechStew (100M)
Speech RecognitionAMI IMHWord Error Rate (WER)9SpeechStew (100M)
Speech RecognitionTedliumWord Error Rate (WER)5.3SpeechStew (100M)
Speech RecognitionCHiME-6 evalWord Error Rate (WER)38.9SpeechStew (1B)
Speech RecognitionWSJ eval92Word Error Rate (WER)1.3Speechstew 100M
Speech RecognitionSwitchboard SWBDWord Error Rate (WER)4.7SpeechStew (100M)
Speech RecognitionAMI SDM1Word Error Rate (WER)21.7SpeechStew (100M)
Speech RecognitionCHiME-6 dev_gss12Word Error Rate (WER)31.9SpeechStew (1B)
Speech RecognitionLibriSpeech test-cleanWord Error Rate (WER)1.7SpeechStew (1B)
Speech RecognitionLibriSpeech test-cleanWord Error Rate (WER)2SpeechStew (100M)
Speech RecognitionLibriSpeech test-otherWord Error Rate (WER)3.3SpeechStew (1B)
Speech RecognitionLibriSpeech test-otherWord Error Rate (WER)4SpeechStew (100M)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17