wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Alexei Baevski, Henry Zhou, Abdel-rahman Mohamed, Michael Auli

2020-06-20NeurIPS 2020 12Speech Recognition Quantization Self-Supervised Learning Zero-Shot Audio Retrieval

Paper PDF Code Code Code Code Code(official)Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code Code

Abstract

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	Libri-Light test-other	Word Error Rate (WER)	5	wav2vec 2.0 Large-10h-LV-60k
Speech Recognition	TIMIT	Percentage error	8.3	wav2vec 2.0
Speech Recognition	Libri-Light test-clean	Word Error Rate (WER)	2.5	wav2vec 2.0 Large-10h-LV-60k
Speech Recognition	LibriSpeech test-clean	Word Error Rate (WER)	1.8	wav2vec 2.0 with Libri-Light
Speech Recognition	LibriSpeech test-other	Word Error Rate (WER)	3	wav2vec 2.0 with Libri-Light
Speech Recognition	LibriSpeech test-other	Word Error Rate (WER)	4.1	wav2vec 2.0

Related Papers

Efficient Deployment of Spiking Neural Networks on SpiNNaker2 for DVS Gesture Recognition Using Neuromorphic Intermediate Representation2025-09-04 An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC2025-07-18 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 Angle Estimation of a Single Source with Massive Uniform Circular Arrays2025-07-17 A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 Quantized Rank Reduction: A Communications-Efficient Federated Learning Scheme for Network-Critical Applications2025-07-15 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14