vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

Alexei Baevski, Steffen Schneider, Michael Auli

2019-10-12ICLR 2020 1Speech Recognition speech-recognition Self-Supervised Learning Clustering General Classification

Abstract

We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	TIMIT	Percentage error	11.6	vq-wav2vec

Related Papers

Tri-Learn Graph Fusion Network for Attributed Graph Clustering2025-07-18 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17 Ranking Vectors Clustering: Theory and Applications2025-07-16 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14 Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14 Car Object Counting and Position Estimation via Extension of the CLIP-EBC Framework2025-07-11