Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever

2022-12-06Preprint 2022 9Speech Recognition Speech-to-Speech Translation speech-recognition Robust Speech Recognition Zero-Shot Audio Retrieval

Paper PDF Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code(official)

Abstract

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

Results

Task	Dataset	Metric	Value	Model
Speech-to-Speech Translation	FLEURS X-eng	ASR-BLEU	23.5	WhisperV2

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14 VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08 A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06 First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02 MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01 AUTOMATIC PRONUNCIATION MISTAKE DETECTOR PROJECT REPORT2025-06-25