Keyu An, Hongyu Xiang, Zhijian Ou
In this paper, we present a new open source toolkit for speech recognition, named CAT (CTC-CRF based ASR Toolkit). CAT inherits the data-efficiency of the hybrid approach and the simplicity of the E2E approach, providing a full-fledged implementation of CTC-CRFs and complete training and testing scripts for a number of English and Chinese benchmarks. Experiments show CAT obtains state-of-the-art results, which are comparable to the fine-tuned hybrid models in Kaldi but with a much simpler training pipeline. Compared to existing non-modularized E2E models, CAT performs better on limited-scale datasets, demonstrating its data efficiency. Furthermore, we propose a new method called contextualized soft forgetting, which enables CAT to do streaming ASR without accuracy degradation. We hope CAT, especially the CTC-CRF based framework and software, will be of broad interest to the community, and can be further explored and improved.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Speech Recognition | WSJ dev93 | Word Error Rate (WER) | 5.7 | CTC-CRF VGG-BLSTM |
| Speech Recognition | WSJ eval92 | Word Error Rate (WER) | 3.2 | CTC-CRF VGG-BLSTM |
| Speech Recognition | Hub5'00 FISHER-SWBD | Word Error Rate (WER) | 12 | CTC-CRF |
| Speech Recognition | Hub5'00 SwitchBoard | CallHome | 18.4 | CTC-CRF |
| Speech Recognition | Hub5'00 SwitchBoard | Hub5'00 | 14.1 | CTC-CRF |
| Speech Recognition | Hub5'00 SwitchBoard | SwitchBoard | 9.7 | CTC-CRF |
| Speech Recognition | AISHELL-1 | Word Error Rate (WER) | 6.34 | CTC-CRF 4gram-LM |