TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/wav2vec 2.0: A Framework for Self-Supervised Learning of S...

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Alexei Baevski, Henry Zhou, Abdel-rahman Mohamed, Michael Auli

2020-06-20NeurIPS 2020 12Speech RecognitionQuantizationSelf-Supervised LearningZero-Shot Audio Retrieval
PaperPDFCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.

Results

TaskDatasetMetricValueModel
Speech RecognitionLibri-Light test-otherWord Error Rate (WER)5wav2vec 2.0 Large-10h-LV-60k
Speech RecognitionTIMITPercentage error8.3wav2vec 2.0
Speech RecognitionLibri-Light test-cleanWord Error Rate (WER)2.5wav2vec 2.0 Large-10h-LV-60k
Speech RecognitionLibriSpeech test-cleanWord Error Rate (WER)1.8wav2vec 2.0 with Libri-Light
Speech RecognitionLibriSpeech test-otherWord Error Rate (WER)3wav2vec 2.0 with Libri-Light
Speech RecognitionLibriSpeech test-otherWord Error Rate (WER)4.1wav2vec 2.0

Related Papers

Efficient Deployment of Spiking Neural Networks on SpiNNaker2 for DVS Gesture Recognition Using Neuromorphic Intermediate Representation2025-09-04An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC2025-07-18Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Angle Estimation of a Single Source with Massive Uniform Circular Arrays2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17Quantized Rank Reduction: A Communications-Efficient Federated Learning Scheme for Network-Critical Applications2025-07-15WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14