TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov

2019-07-26Text ClassificationReading ComprehensionQuestion AnsweringMulti-task Language UnderstandingSentence CompletionOnly Connect Walls Dataset Task 1 (Grouping)Stock Market PredictionSentiment AnalysisNatural Language InferenceCommon Sense ReasoningLexical SimplificationType predictionSemantic Textual SimilarityLinguistic AcceptabilityDocument Image ClassificationRiddle SenseLanguage Modelling
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

Results

TaskDatasetMetricValueModel
Stock Market PredictionAstockAccuray62.49RoBERTa WWM Ext (News+Factors)
Stock Market PredictionAstockF1-score62.54RoBERTa WWM Ext (News+Factors)
Stock Market PredictionAstockPrecision62.59RoBERTa WWM Ext (News+Factors)
Stock Market PredictionAstockRecall62.51RoBERTa WWM Ext (News+Factors)
Stock Market PredictionAstockAccuray61.34RoBERTa WWM Ext (News)
Stock Market PredictionAstockF1-score61.48RoBERTa WWM Ext (News)
Stock Market PredictionAstockPrecision61.97RoBERTa WWM Ext (News)
Stock Market PredictionAstockRecall61.32RoBERTa WWM Ext (News)
Reading ComprehensionRACEAccuracy83.2RoBERTa
Reading ComprehensionRACEAccuracy (High)81.3RoBERTa
Reading ComprehensionRACEAccuracy (Middle)86.5RoBERTa
Question AnsweringSIQAAccuracy76.7RoBERTa-Large 355M (fine-tuned)
Question AnsweringPIQAAccuracy79.4RoBERTa-Large 355M
Question AnsweringSQuAD2.0 devEM86.5RoBERTa (no data aug)
Question AnsweringSQuAD2.0 devF189.4RoBERTa (no data aug)
Question AnsweringSQuAD2.0EM86.82RoBERTa (single model)
Question AnsweringSQuAD2.0F189.795RoBERTa (single model)
Question AnsweringSQuAD2.0EM86.82RoBERTa (single model)
Question AnsweringSQuAD2.0F189.795RoBERTa (single model)
Common Sense ReasoningSWAGTest89.9RoBERTa
Common Sense ReasoningCommonsenseQAAccuracy72.1RoBERTa-Large 355M
Natural Language InferenceWNLIAccuracy89RoBERTa (ensemble)
Natural Language InferenceANLI testA172.4RoBERTa (Large)
Natural Language InferenceANLI testA249.8RoBERTa (Large)
Natural Language InferenceANLI testA344.4RoBERTa (Large)
Natural Language InferenceMultiNLIMatched90.8RoBERTa
Natural Language InferenceMultiNLIMismatched90.2RoBERTa (ensemble)
Semantic Textual SimilaritySTS BenchmarkPearson Correlation0.922RoBERTa
Sentiment AnalysisSST-2 Binary classificationAccuracy96.7RoBERTa (ensemble)
Program SynthesisManyTypes4TypeScriptAverage Accuracy59.84RoBERTa
Program SynthesisManyTypes4TypeScriptAverage F157.54RoBERTa
Program SynthesisManyTypes4TypeScriptAverage Precision57.45RoBERTa
Program SynthesisManyTypes4TypeScriptAverage Recall57.62RoBERTa
Document Image ClassificationRVL-CDIPAccuracy90.06Roberta base
Text ClassificationarXiv-10Accuracy0.779RoBERTa
Image ClassificationRVL-CDIPAccuracy90.06Roberta base
Type predictionManyTypes4TypeScriptAverage Accuracy59.84RoBERTa
Type predictionManyTypes4TypeScriptAverage F157.54RoBERTa
Type predictionManyTypes4TypeScriptAverage Precision57.45RoBERTa
Type predictionManyTypes4TypeScriptAverage Recall57.62RoBERTa
Stock Trend PredictionAstockAccuray62.49RoBERTa WWM Ext (News+Factors)
Stock Trend PredictionAstockF1-score62.54RoBERTa WWM Ext (News+Factors)
Stock Trend PredictionAstockPrecision62.59RoBERTa WWM Ext (News+Factors)
Stock Trend PredictionAstockRecall62.51RoBERTa WWM Ext (News+Factors)
Stock Trend PredictionAstockAccuray61.34RoBERTa WWM Ext (News)
Stock Trend PredictionAstockF1-score61.48RoBERTa WWM Ext (News)
Stock Trend PredictionAstockPrecision61.97RoBERTa WWM Ext (News)
Stock Trend PredictionAstockRecall61.32RoBERTa WWM Ext (News)
ClassificationarXiv-10Accuracy0.779RoBERTa
Sentence CompletionHellaSwagAccuracy85.5RoBERTa-Large Ensemble
Sentence CompletionHellaSwagAccuracy81.7RoBERTa-Large 355M

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17