TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BigSSL: Exploring the Frontier of Large-Scale Semi-Supervi...

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, Yonghui Wu

2021-09-27Speech RecognitionAutomatic Speech RecognitionLanguage IdentificationAutomatic Speech Recognition (ASR)speech-recognitionSpeech Emotion Recognition
PaperPDF

Abstract

We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.

Results

TaskDatasetMetricValueModel
Speech RecognitionAMI IMHWord Error Rate (WER)7.8ConformerXXL-P + Downstream NST
Speech RecognitionCHiME-6 evalWord Error Rate (WER)31ConformerXXL-PS
Speech RecognitionWSJ eval92Word Error Rate (WER)1.3ConformerXXL-P
Speech RecognitionAMI SDM1Word Error Rate (WER)17.7ConformerXXL-P
Speech RecognitionCHiME-6 dev_gss12Word Error Rate (WER)26.2ConformerXXL-PS
Speech RecognitionTED-LIUMWord Error Rate (WER)5ConformerXXL-PS
Emotion RecognitionCREMA-DAccuracy88.2ConformerXL-P
Language IdentificationVoxForgeAccuracy99.8ConformerG-P
Speech Emotion RecognitionCREMA-DAccuracy88.2ConformerXL-P

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation2025-07-11VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01