TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/The PyTorch-Kaldi Speech Recognition Toolkit

The PyTorch-Kaldi Speech Recognition Toolkit

Mirco Ravanelli, Titouan Parcollet, Yoshua Bengio

2018-11-19Speech RecognitionDistant Speech Recognition
PaperPDFCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility. The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters. Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.

Results

TaskDatasetMetricValueModel
Speech RecognitionTIMITPercentage error14.2LiGRU + Dropout + BatchNorm + Monophone Reg
Speech RecognitionTIMITPercentage error14.5LSTM + Dropout + BatchNorm + Monophone Reg
Speech RecognitionTIMITPercentage error14.9GRU + Dropout + BatchNorm + Monophone Reg
Speech RecognitionTIMITPercentage error15.9RNN + Dropout + BatchNorm + Monophone Reg
Speech RecognitionTIMITPercentage error16LSTM
Speech RecognitionTIMITPercentage error16.3Li-GRU
Speech RecognitionTIMITPercentage error16.5RNN
Speech RecognitionTIMITPercentage error16.6GRU
Speech RecognitionLibriSpeech test-cleanWord Error Rate (WER)6.2Li-GRU
Speech RecognitionDIRHA English WSJWord Error Rate (WER)23.9Li-GRU
Speech RecognitionCHiME realPercentage error14.6Li-GRU

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08A Hybrid Machine Learning Framework for Optimizing Crop Selection via Agronomic and Economic Forecasting2025-07-06First Steps Towards Voice Anonymization for Code-Switching Speech2025-07-02MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement2025-07-01AUTOMATIC PRONUNCIATION MISTAKE DETECTOR PROJECT REPORT2025-06-25