SciBERT: A Pretrained Language Model for Scientific Text

Iz Beltagy, Kyle Lo, Arman Cohan

2019-03-26IJCNLP 2019 11Participant Intervention Comparison Outcome Extraction Medical Named Entity Recognition Relation Extraction General Classification Named Entity Recognition (NER)Citation Intent Classification Sentence Classification Dependency Parsing Language Modelling

Paper PDF Code Code Code Code(official)Code Code

Abstract

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

Results

Task	Dataset	Metric	Value	Model
Relation Extraction	ChemProt	F1	83.64	SciBert (Finetune)
Relation Extraction	ChemProt	F1	73.7	SciBERT (Base Vocab)
Relation Extraction	SciERC	F1	74.64	SciBERT (SciVocab)
Relation Extraction	SciERC	F1	74.42	SciBERT (Base Vocab)
Relation Extraction	JNLPBA	F1	76.09	SciBERT (SciVocab)
Information Extraction	EBM-NLP	F1	71.18	SciBERT (SciVocab)
Information Extraction	EBM-NLP	F1	70.82	SciBERT (Base Vocab)
Dependency Parsing	GENIA - UAS	F1	92.46	SciBERT (SciVocab)
Dependency Parsing	GENIA - UAS	F1	92.32	SciBERT (Base Vocab)
Dependency Parsing	GENIA - LAS	F1	91.41	SciBERT (SciVocab)
Dependency Parsing	GENIA - LAS	F1	91.26	SciBERT (Base Vocab)
Named Entity Recognition (NER)	NCBI-disease	F1	86.88	SciBERT (Base Vocab)
Named Entity Recognition (NER)	NCBI-disease	F1	86.45	SciBERT (SciVocab)
Named Entity Recognition (NER)	SciERC	F1	67.57	SciBERT (SciVocab)
Named Entity Recognition (NER)	SciERC	F1	65.24	SciBERT (Base Vocab)
Named Entity Recognition (NER)	BC5CDR	F1	88.94	SciBERT (SciVocab)
Named Entity Recognition (NER)	BC5CDR	F1	88.11	SciBERT (Base Vocab)
Named Entity Recognition (NER)	JNLPBA	F1	75.77	SciBERT (Base Vocab)
Text Classification	ACL-ARC	F1	70.98	SciBERT
Text Classification	Paper Field	F1	65.71	SciBERT (SciVocab)
Text Classification	Paper Field	F1	64.02	SciBERT (Base Vocab)
Text Classification	ScienceCite	F1	84.99	SciBERT (SciVocab)
Text Classification	ScienceCite	F1	84.43	SciBERT (Base Vocab)
Text Classification	PubMed 20k RCT	F1	86.81	SciBERT (Base Vocab)
Text Classification	SciCite	F1	84.9	SciBERT
Text Classification	SciCite	Macro-F1	86.32	SciBERT
Participant Intervention Comparison Outcome Extraction	EBM-NLP	F1	71.18	SciBERT (SciVocab)
Participant Intervention Comparison Outcome Extraction	EBM-NLP	F1	70.82	SciBERT (Base Vocab)
Sentence Classification	ACL-ARC	F1	70.98	SciBERT
Sentence Classification	Paper Field	F1	65.71	SciBERT (SciVocab)
Sentence Classification	Paper Field	F1	64.02	SciBERT (Base Vocab)
Sentence Classification	ScienceCite	F1	84.99	SciBERT (SciVocab)
Sentence Classification	ScienceCite	F1	84.43	SciBERT (Base Vocab)
Sentence Classification	PubMed 20k RCT	F1	86.81	SciBERT (Base Vocab)
Sentence Classification	SciCite	F1	84.9	SciBERT
Classification	ACL-ARC	F1	70.98	SciBERT
Classification	Paper Field	F1	65.71	SciBERT (SciVocab)
Classification	Paper Field	F1	64.02	SciBERT (Base Vocab)
Classification	ScienceCite	F1	84.99	SciBERT (SciVocab)
Classification	ScienceCite	F1	84.43	SciBERT (Base Vocab)
Classification	PubMed 20k RCT	F1	86.81	SciBERT (Base Vocab)
Classification	SciCite	F1	84.9	SciBERT
Classification	SciCite	Macro-F1	86.32	SciBERT

SciBERT: A Pretrained Language Model for Scientific Text

Abstract

Results

Related Papers

SciBERT: A Pretrained Language Model for Scientific Text

Abstract

Results

Related Papers