Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Xiang Dai, Sarvnaz Karimi, Ben Hachey, Cecile Paris

2020-10-02Findings of the Association for Computational Linguistics 2020Clinical Concept Extraction

Abstract

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

Results

Task	Dataset	Metric	Value	Model
Clinical Concept Extraction	2010 i2b2/VA	Exact Span F1	87.4	ClinicalBERT

Related Papers

Selective Attention Federated Learning: Improving Privacy and Efficiency for Clinical Text Classification2025-04-16 BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports2024-08-21 Clinical Concept and Relation Extraction Using Prompt-based Machine Reading Comprehension2023-03-14 Accurate clinical and biomedical Named entity recognition at scale2022-07-19 GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records2022-02-02 CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain2021-12-16 Improving Clinical Document Understanding on COVID-19 Research with Spark NLP2020-12-07 NLNDE at CANTEMIST: Neural Sequence Labeling and Parsing Approaches for Clinical Concept Extraction2020-10-23