Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Stock Market Prediction | Astock | Accuray | 62.49 | RoBERTa WWM Ext (News+Factors) |
| Stock Market Prediction | Astock | F1-score | 62.54 | RoBERTa WWM Ext (News+Factors) |
| Stock Market Prediction | Astock | Precision | 62.59 | RoBERTa WWM Ext (News+Factors) |
| Stock Market Prediction | Astock | Recall | 62.51 | RoBERTa WWM Ext (News+Factors) |
| Stock Market Prediction | Astock | Accuray | 61.34 | RoBERTa WWM Ext (News) |
| Stock Market Prediction | Astock | F1-score | 61.48 | RoBERTa WWM Ext (News) |
| Stock Market Prediction | Astock | Precision | 61.97 | RoBERTa WWM Ext (News) |
| Stock Market Prediction | Astock | Recall | 61.32 | RoBERTa WWM Ext (News) |
| Reading Comprehension | RACE | Accuracy | 83.2 | RoBERTa |
| Reading Comprehension | RACE | Accuracy (High) | 81.3 | RoBERTa |
| Reading Comprehension | RACE | Accuracy (Middle) | 86.5 | RoBERTa |
| Question Answering | SIQA | Accuracy | 76.7 | RoBERTa-Large 355M (fine-tuned) |
| Question Answering | PIQA | Accuracy | 79.4 | RoBERTa-Large 355M |
| Question Answering | SQuAD2.0 dev | EM | 86.5 | RoBERTa (no data aug) |
| Question Answering | SQuAD2.0 dev | F1 | 89.4 | RoBERTa (no data aug) |
| Question Answering | SQuAD2.0 | EM | 86.82 | RoBERTa (single model) |
| Question Answering | SQuAD2.0 | F1 | 89.795 | RoBERTa (single model) |
| Question Answering | SQuAD2.0 | EM | 86.82 | RoBERTa (single model) |
| Question Answering | SQuAD2.0 | F1 | 89.795 | RoBERTa (single model) |
| Common Sense Reasoning | SWAG | Test | 89.9 | RoBERTa |
| Common Sense Reasoning | CommonsenseQA | Accuracy | 72.1 | RoBERTa-Large 355M |
| Natural Language Inference | WNLI | Accuracy | 89 | RoBERTa (ensemble) |
| Natural Language Inference | ANLI test | A1 | 72.4 | RoBERTa (Large) |
| Natural Language Inference | ANLI test | A2 | 49.8 | RoBERTa (Large) |
| Natural Language Inference | ANLI test | A3 | 44.4 | RoBERTa (Large) |
| Natural Language Inference | MultiNLI | Matched | 90.8 | RoBERTa |
| Natural Language Inference | MultiNLI | Mismatched | 90.2 | RoBERTa (ensemble) |
| Semantic Textual Similarity | STS Benchmark | Pearson Correlation | 0.922 | RoBERTa |
| Sentiment Analysis | SST-2 Binary classification | Accuracy | 96.7 | RoBERTa (ensemble) |
| Program Synthesis | ManyTypes4TypeScript | Average Accuracy | 59.84 | RoBERTa |
| Program Synthesis | ManyTypes4TypeScript | Average F1 | 57.54 | RoBERTa |
| Program Synthesis | ManyTypes4TypeScript | Average Precision | 57.45 | RoBERTa |
| Program Synthesis | ManyTypes4TypeScript | Average Recall | 57.62 | RoBERTa |
| Document Image Classification | RVL-CDIP | Accuracy | 90.06 | Roberta base |
| Text Classification | arXiv-10 | Accuracy | 0.779 | RoBERTa |
| Image Classification | RVL-CDIP | Accuracy | 90.06 | Roberta base |
| Type prediction | ManyTypes4TypeScript | Average Accuracy | 59.84 | RoBERTa |
| Type prediction | ManyTypes4TypeScript | Average F1 | 57.54 | RoBERTa |
| Type prediction | ManyTypes4TypeScript | Average Precision | 57.45 | RoBERTa |
| Type prediction | ManyTypes4TypeScript | Average Recall | 57.62 | RoBERTa |
| Stock Trend Prediction | Astock | Accuray | 62.49 | RoBERTa WWM Ext (News+Factors) |
| Stock Trend Prediction | Astock | F1-score | 62.54 | RoBERTa WWM Ext (News+Factors) |
| Stock Trend Prediction | Astock | Precision | 62.59 | RoBERTa WWM Ext (News+Factors) |
| Stock Trend Prediction | Astock | Recall | 62.51 | RoBERTa WWM Ext (News+Factors) |
| Stock Trend Prediction | Astock | Accuray | 61.34 | RoBERTa WWM Ext (News) |
| Stock Trend Prediction | Astock | F1-score | 61.48 | RoBERTa WWM Ext (News) |
| Stock Trend Prediction | Astock | Precision | 61.97 | RoBERTa WWM Ext (News) |
| Stock Trend Prediction | Astock | Recall | 61.32 | RoBERTa WWM Ext (News) |
| Classification | arXiv-10 | Accuracy | 0.779 | RoBERTa |
| Sentence Completion | HellaSwag | Accuracy | 85.5 | RoBERTa-Large Ensemble |
| Sentence Completion | HellaSwag | Accuracy | 81.7 | RoBERTa-Large 355M |