Metric: Word Error Rate (WER) (lower is better)
| # | Model↕ | Word Error Rate (WER)▲ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Whisper | 0.68 | Yes | Whisper-Flamingo: Integrating Visual Features in... | 2024-06-14 | Code |
| 2 | Llama-AVSR | 0.81 | Yes | Large Language Models are Strong Audio-Visual Sp... | 2024-09-18 | Code |
| 3 | CTC/Attention | 1 | No | Auto-AVSR: Audio-Visual Speech Recognition with ... | 2023-03-25 | Code |
| 4 | AV-HuBERT Large | 1.3 | Yes | Learning Audio-Visual Speech Representation by M... | 2022-01-05 | Code |
| 5 | RAVEn Large | 1.4 | Yes | Jointly Learning Visual and Auditory Speech Repr... | 2022-12-12 | Code |
| 6 | CTC/Attention | 19.1 | Yes | Auto-AVSR: Audio-Visual Speech Recognition with ... | 2023-03-25 | Code |
| 7 | VTP with more data | 30.7 | Yes | Sub-word Level Lip Reading With Visual Attention | 2021-10-14 | - |
| 8 | VTP | 40.6 | Yes | Sub-word Level Lip Reading With Visual Attention | 2021-10-14 | - |