Metric: Test WER (lower is better)
| # | Model↕ | Test WER▲ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Whisper | 1.3 | Yes | Whisper-Flamingo: Integrating Visual Features in... | 2024-06-14 | Code |
| 2 | CTC/Attention | 1.5 | Yes | Auto-AVSR: Audio-Visual Speech Recognition with ... | 2023-03-25 | Code |
| 3 | MoCo + wav2vec (w/o extLM) | 2.7 | No | Leveraging Unimodal Self-Supervised Learning for... | 2022-02-24 | Code |
| 4 | End2end Conformer | 3.9 | No | End-to-end Audio-visual Speech Recognition with ... | 2021-02-12 | Code |
| 5 | Whisper-LLaMA | 6.6 | No | Whispering LLaMA: A Cross-Modal Generative Error... | 2023-10-10 | Code |
| 6 | LF-MMI TDNN | 6.7 | No | Audio-visual Recognition of Overlapped speech fo... | 2020-01-06 | - |
| 7 | CTC/attention | 8.2 | No | Audio-Visual Speech Recognition With A Hybrid CT... | 2018-09-28 | - |
| 8 | TM-seq2seq | 9.7 | No | Deep Audio-Visual Speech Recognition | 2018-09-06 | Code |
| 9 | TM-CTC | 10.1 | No | Deep Audio-Visual Speech Recognition | 2018-09-06 | Code |