Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media
Xiang Dai, Sarvnaz Karimi, Ben Hachey, Cecile Paris
Abstract
Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.
Results
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Clinical Concept Extraction | 2010 i2b2/VA | Exact Span F1 | 87.4 | ClinicalBERT |
Related Papers
Selective Attention Federated Learning: Improving Privacy and Efficiency for Clinical Text Classification2025-04-16BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports2024-08-21Clinical Concept and Relation Extraction Using Prompt-based Machine Reading Comprehension2023-03-14Accurate clinical and biomedical Named entity recognition at scale2022-07-19GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records2022-02-02CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain2021-12-16Improving Clinical Document Understanding on COVID-19 Research with Spark NLP2020-12-07NLNDE at CANTEMIST: Neural Sequence Labeling and Parsing Approaches for Clinical Concept Extraction2020-10-23