Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Stock Market Prediction | Astock | Accuray | 59.11 | Bert Chinese |
| Stock Market Prediction | Astock | F1-score | 58.99 | Bert Chinese |
| Stock Market Prediction | Astock | Precision | 59.07 | Bert Chinese |
| Stock Market Prediction | Astock | Recall | 59.2 | Bert Chinese |
| Reading Comprehension | PhotoChat | F1 | 53.2 | BERT |
| Reading Comprehension | PhotoChat | Precision | 56.1 | BERT |
| Reading Comprehension | PhotoChat | Recall | 50.6 | BERT |
| Question Answering | SQuAD1.1 dev | EM | 86.2 | BERT-LARGE (Ensemble+TriviaQA) |
| Question Answering | SQuAD1.1 dev | F1 | 92.2 | BERT-LARGE (Ensemble+TriviaQA) |
| Question Answering | SQuAD1.1 dev | EM | 84.2 | BERT-LARGE (Single+TriviaQA) |
| Question Answering | SQuAD1.1 dev | F1 | 91.1 | BERT-LARGE (Single+TriviaQA) |
| Question Answering | MRQA | Average F1 | 78.5 | BERT (large) |
| Question Answering | MultiTQ | Hits@1 | 8.3 | BERT |
| Question Answering | MultiTQ | Hits@10 | 48.2 | BERT |
| Question Answering | CoQA | In-domain | 82.5 | BERT Large Augmented (single model) |
| Question Answering | CoQA | Out-of-domain | 77.6 | BERT Large Augmented (single model) |
| Question Answering | CoQA | Overall | 81.1 | BERT Large Augmented (single model) |
| Question Answering | CoQA | In-domain | 79.8 | BERT-base finetune (single model) |
| Question Answering | CoQA | Out-of-domain | 74.1 | BERT-base finetune (single model) |
| Question Answering | CoQA | Overall | 78.1 | BERT-base finetune (single model) |
| Question Answering | MultiRC | EM | 24.1 | BERT-large(single model) |
| Question Answering | MultiRC | F1 | 70 | BERT-large(single model) |
| Question Answering | PIQA | Accuracy | 66.7 | BERT-Large 340M |
| Question Answering | SQuAD1.1 | EM | 87.433 | BERT (ensemble) |
| Question Answering | SQuAD1.1 | F1 | 93.16 | BERT (ensemble) |
| Question Answering | SQuAD1.1 | EM | 87.4 | BERT-LARGE (Ensemble+TriviaQA) |
| Question Answering | SQuAD1.1 | F1 | 93.2 | BERT-LARGE (Ensemble+TriviaQA) |
| Question Answering | SQuAD1.1 | EM | 85.083 | BERT (single model) |
| Question Answering | SQuAD1.1 | F1 | 91.835 | BERT (single model) |
| Question Answering | SQuAD1.1 | F1 | 91.8 | BERT-LARGE (Single+TriviaQA) |
| Common Sense Reasoning | SWAG | Dev | 86.6 | BERT-LARGE |
| Common Sense Reasoning | SWAG | Test | 86.3 | BERT-LARGE |
| Common Sense Reasoning | ReCoRD | EM | 54.04 | BERT-Base (single model) |
| Common Sense Reasoning | ReCoRD | F1 | 56.065 | BERT-Base (single model) |
| Natural Language Inference | WNLI | Accuracy | 65.1 | BERT-large 340M |
| Natural Language Inference | MultiNLI | Matched | 86.7 | BERT-LARGE |
| Natural Language Inference | MultiNLI | Mismatched | 85.9 | BERT-LARGE |
| Emotion Recognition | CPED | Accuracy of Sentiment | 48.96 | BERT_{utt} |
| Emotion Recognition | CPED | Macro-F1 of Sentiment | 45.18 | BERT_{utt} |
| Semantic Textual Similarity | MRPC | F1 | 89.3 | BERT-LARGE |
| Semantic Textual Similarity | STS Benchmark | Spearman Correlation | 0.865 | BERT-LARGE |
| Semantic Textual Similarity | Quora Question Pairs | F1 | 72.1 | BERT-LARGE |
| Sentiment Analysis | SST-2 Binary classification | Accuracy | 94.9 | BERT-LARGE |
| Program Synthesis | ManyTypes4TypeScript | Average Accuracy | 57.52 | BERT |
| Program Synthesis | ManyTypes4TypeScript | Average F1 | 54.1 | BERT |
| Program Synthesis | ManyTypes4TypeScript | Average Precision | 54.18 | BERT |
| Program Synthesis | ManyTypes4TypeScript | Average Recall | 54.02 | BERT |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 62 | BERT-large 340M |
| Paraphrase Identification | Quora Question Pairs | F1 | 72.1 | BERT-LARGE |
| Text Classification | DBpedia | Error | 0.64 | Bidirectional Encoder Representations from Transformers |
| Type prediction | ManyTypes4TypeScript | Average Accuracy | 57.52 | BERT |
| Type prediction | ManyTypes4TypeScript | Average F1 | 54.1 | BERT |
| Type prediction | ManyTypes4TypeScript | Average Precision | 54.18 | BERT |
| Type prediction | ManyTypes4TypeScript | Average Recall | 54.02 | BERT |
| Natural Language Understanding | GLUE | Average | 82.1 | BERT-LARGE |
| Natural Language Understanding | PDP60 | Accuracy | 78.3 | BERT-large 340M |
| Stock Trend Prediction | Astock | Accuray | 59.11 | Bert Chinese |
| Stock Trend Prediction | Astock | F1-score | 58.99 | Bert Chinese |
| Stock Trend Prediction | Astock | Precision | 59.07 | Bert Chinese |
| Stock Trend Prediction | Astock | Recall | 59.2 | Bert Chinese |
| Classification | DBpedia | Error | 0.64 | Bidirectional Encoder Representations from Transformers |
| Intent Recognition | PhotoChat | F1 | 53.2 | BERT |
| Intent Recognition | PhotoChat | Precision | 56.1 | BERT |
| Intent Recognition | PhotoChat | Recall | 50.6 | BERT |