How to Train BERT with an Academic Budget

Peter Izsak, Moshe Berchansky, Omer Levy

2021-04-15EMNLP 2021 11Question Answering Sentiment Analysis Natural Language Inference Semantic Textual Similarity Linguistic Acceptability Language Modelling

Paper PDF Code(official)Code Code(official)Code

Abstract

While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.

Results

Task	Dataset	Metric	Value	Model
Question Answering	Quora Question Pairs	Accuracy	70.7	24hBERT
Natural Language Inference	QNLI	Accuracy	90.6	24hBERT
Natural Language Inference	MultiNLI	Matched	84.4	24hBERT
Natural Language Inference	MultiNLI	Mismatched	83.8	24hBERT
Semantic Textual Similarity	STS Benchmark	Pearson Correlation	0.82	24hBERT
Sentiment Analysis	SST-2 Binary classification	Accuracy	93	24hBERT
Linguistic Acceptability	CoLA	Accuracy	57.1	24hBERT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17 SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17