TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CR-CTC: Consistency regularization on CTC for improved spe...

CR-CTC: Consistency regularization on CTC for improved speech recognition

Zengwei Yao, Wei Kang, Xiaoyu Yang, Fangjun Kuang, Liyong Guo, Han Zhu, Zengrui Jin, Zhaoqing Li, Long Lin, Daniel Povey

2024-10-07Speech RecognitionAutomatic Speech RecognitionAutomatic Speech Recognition (ASR)speech-recognition
PaperPDFCode(official)

Abstract

Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. We provide in-depth insights into its essential behaviors from three perspectives: 1) it conducts self-distillation between random pairs of sub-models that process different augmented views; 2) it learns contextual representation through masked prediction for positions within time-masked regions, especially when we increase the amount of time masking; 3) it suppresses the extremely peaky CTC distributions, thereby reducing overfitting and improving the generalization ability. Extensive experiments on LibriSpeech, Aishell-1, and GigaSpeech datasets demonstrate the effectiveness of our CR-CTC. It significantly improves the CTC performance, achieving state-of-the-art results comparable to those attained by transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED). We release our code at https://github.com/k2-fsa/icefall.

Results

TaskDatasetMetricValueModel
Speech RecognitionGigaSpeech DEVWord Error Rate (WER)9.95Zipformer+pruned transducer w/ CR-CTC (no external language model)
Speech RecognitionGigaSpeech DEVWord Error Rate (WER)10.09Zipformer+pruned transducer (no external language model)
Speech RecognitionGigaSpeech DEVWord Error Rate (WER)10.15Zipformer+CR-CTC (no external language model)
Speech RecognitionLibriSpeech test-cleanWord Error Rate (WER)1.88Zipformer+pruned transducer w/ CR-CTC (no external language model)
Speech RecognitionLibriSpeech test-cleanWord Error Rate (WER)2.02Zipformer+CR-CTC (no external language model)
Speech RecognitionGigaSpeech TESTWord Error Rate (WER)10.03Zipformer+pruned transducer w/ CR-CTC (no external language model)
Speech RecognitionGigaSpeech TESTWord Error Rate (WER)10.07Zipformer+CR-CTC/AED (no external language model)
Speech RecognitionGigaSpeech TESTWord Error Rate (WER)10.2Zipformer+pruned transducer (no external language model)
Speech RecognitionGigaSpeech TESTWord Error Rate (WER)10.28Zipformer+CR-CTC (no external language model)
Speech RecognitionLibriSpeech test-otherWord Error Rate (WER)3.95Zipformer+pruned transducer w/ CR-CTC (no external language model)
Speech RecognitionLibriSpeech test-otherWord Error Rate (WER)4.35Zipformer+CR-CTC (no external language model)
Speech RecognitionAISHELL-1Params(M)66.2Zipformer+CR-CTC (no external language model)
Speech RecognitionAISHELL-1Word Error Rate (WER)4.02Zipformer+CR-CTC (no external language model)

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01AUTOMATIC PRONUNCIATION MISTAKE DETECTOR PROJECT REPORT2025-06-25