Metric: Average (%) (higher is better)
| # | Model↕ | Average (%)▼ | Augmentations | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | code-davinci-002 175B (CoT) | 73.9 | No | Evaluating Large Language Models Trained on Code | 2021-07-07 | Code |
| 2 | Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC) | 66.5 | No | Scaling Instruction-Finetuned Language Models | 2022-10-20 | Code |
| 3 | PaLM 540B (CoT + self-consistency) | 62.2 | No | Scaling Instruction-Finetuned Language Models | 2022-10-20 | Code |
| 4 | Flan-PaLM 540B (3-shot, fine-tuned, CoT) | 61.3 | No | Scaling Instruction-Finetuned Language Models | 2022-10-20 | Code |
| 5 | PaLM 540B (CoT) | 57.6 | No | Scaling Instruction-Finetuned Language Models | 2022-10-20 | Code |
| 6 | Flan-PaLM 540B (3-shot, fine-tuned) | 48.2 | No | Scaling Instruction-Finetuned Language Models | 2022-10-20 | Code |
| 7 | PaLM 540B | 38.3 | No | Scaling Instruction-Finetuned Language Models | 2022-10-20 | Code |