Metric: EM (higher is better)
| # | Model↕ | EM▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Turing NLR v5 XXL 5.4B (fine-tuned) | 95.9 | No | Toward Efficient Language Model Pretraining and ... | 2022-12-04 | - |
| 2 | ST-MoE-32B 269B (fine-tuned) | 95.1 | No | ST-MoE: Designing Stable and Transferable Sparse... | 2022-02-17 | Code |
| 3 | DeBERTa-1.5B | 94.1 | No | DeBERTa: Decoding-enhanced BERT with Disentangle... | 2020-06-05 | Code |
| 4 | PaLM 540B (finetuned) | 94 | No | PaLM: Scaling Language Modeling with Pathways | 2022-04-05 | Code |
| 5 | Vega v2 6B (fine-tuned) | 93.9 | No | Toward Efficient Language Model Pretraining and ... | 2022-12-04 | - |
| 6 | T5-XXL 11B (fine-tuned) | 93.4 | No | Exploring the Limits of Transfer Learning with a... | 2019-10-23 | Code |
| 7 | GESA 500M | 91.7 | No | Integrating a Heterogeneous Graph with Entity-aw... | 2023-07-19 | - |
| 8 | LUKE-Graph | 91.2 | No | LUKE-Graph: A Transformer-based Approach with Ga... | 2023-03-12 | - |
| 9 | LUKE (single model) | 90.64 | No | - | - | - |
| 10 | LUKE 483M | 90.6 | No | LUKE: Deep Contextualized Entity Representations... | 2020-10-02 | Code |
| 11 | KELM (finetuning RoBERTa-large based single model) | 89.1 | No | KELM: Knowledge Enhanced Pre-Trained Language Re... | 2021-09-09 | Code |
| 12 | ST-MoE-L 4.1B (fine-tuned) | 88.9 | No | ST-MoE: Designing Stable and Transferable Sparse... | 2022-02-17 | Code |
| 13 | FLAN 137B (prompt-tuned) | 85.1 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 14 | XLNet + MTL + Verifier (ensemble) | 83.09 | No | - | - | - |
| 15 | GPT-3 Large 760M (0-shot) | 82.1 | No | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 16 | CSRLM (single model) | 81.78 | No | - | - | - |
| 17 | XLNet + Verifier | 81.5 | No | - | - | - |
| 18 | XLNet + MTL + Verifier (single model) | 81.46 | No | - | - | - |
| 19 | Switch Transformer 9B | 79.9 | No | Efficient Language Modeling with Sparse all-MLP | 2022-03-14 | - |
| 20 | {SKG-NET} (single model) | 79.48 | No | - | - | - |
| 21 | KELM (finetuning BERT-large based single model) | 76.2 | No | KELM: Knowledge Enhanced Pre-Trained Language Re... | 2021-09-09 | Code |
| 22 | sMLP – deterministic 9.4B (0-shot) | 73.4 | No | Efficient Language Modeling with Sparse all-MLP | 2022-03-14 | - |
| 23 | FLAN 137B (zero-shot) | 72.5 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 24 | Gshard 9B | 72.4 | No | Efficient Language Modeling with Sparse all-MLP | 2022-03-14 | - |
| 25 | SKG-BERT (single model) | 72.24 | No | - | - | - |
| 26 | KT-NET (single model) | 71.6 | No | - | - | - |
| 27 | DCReader+BERT (single model) | 69.49 | No | - | - | - |
| 28 | HASH Layers 10B (0-shot) | 67.2 | No | Efficient Language Modeling with Sparse all-MLP | 2022-03-14 | - |
| 29 | GraphBert (single) | 60.8 | No | - | - | - |
| 30 | Base Layers 10B (0-shot) | 60.7 | No | Efficient Language Modeling with Sparse all-MLP | 2022-03-14 | - |
| 31 | GraphBert-WordNet (single) | 59.86 | No | - | - | - |
| 32 | GraphBert-NELL (single) | 59.41 | No | - | - | - |
| 33 | BERT-Base (single model) | 54.04 | No | BERT: Pre-training of Deep Bidirectional Transfo... | 2018-10-11 | Code |
| 34 | DocQA + ELMo | 45.4 | No | ReCoRD: Bridging the Gap between Human and Machi... | 2018-10-30 | - |
| 35 | N-Grammer 343M | 28.9 | No | N-Grammer: Augmenting Transformers with latent n... | 2022-07-13 | Code |