TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SpecAugment: A Simple Data Augmentation Method for Automat...

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le

2019-04-18Speech RecognitionAutomatic Speech RecognitionAutomatic Speech Recognition (ASR)Data AugmentationLanguage Modelling
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

Results

TaskDatasetMetricValueModel
Speech RecognitionHub5'00 SwitchBoardCallHome14.6LAS + SpecAugment (with LM, Switchboard mild policy)
Speech RecognitionHub5'00 SwitchBoardSwitchBoard6.8LAS + SpecAugment (with LM, Switchboard mild policy)
Speech RecognitionHub5'00 SwitchBoardCallHome14LAS + SpecAugment (with LM, Switchboard strong policy)
Speech RecognitionHub5'00 SwitchBoardSwitchBoard7.1LAS + SpecAugment (with LM, Switchboard strong policy)
Speech RecognitionLibriSpeech test-cleanWord Error Rate (WER)2.5LAS + SpecAugment
Speech RecognitionLibriSpeech test-cleanWord Error Rate (WER)2.7LAS (no LM)
Speech RecognitionLibriSpeech test-otherWord Error Rate (WER)5.8LAS + SpecAugment
Speech RecognitionLibriSpeech test-otherWord Error Rate (WER)6.5LAS (no LM)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17