MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining

Zhi Wen, Xing Han Lu, Siva Reddy

2020-12-27EMNLP (ClinicalNLP) 2020 11Natural Language Understanding Mortality Prediction

Abstract

One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.

Results

Task	Dataset	Metric	Value	Model
Electrocardiography (ECG)	MIMIC-III	Accuracy	0.8443	ELECTRA (pretrained)
Electrocardiography (ECG)	MIMIC-III	Accuracy	0.8325	ELECTRA (from scratch)
Electrocardiography (ECG)	MIMIC-III	Accuracy	0.8298	LSTM+SA (pretrained)
Electrocardiography (ECG)	MIMIC-III	Accuracy	0.828	LSTM (pretrained)
Electrocardiography (ECG)	MIMIC-III	Accuracy	0.7996	LSTM+SA (from scratch)
Mortality Prediction	MIMIC-III	Accuracy	0.8443	ELECTRA (pretrained)
Mortality Prediction	MIMIC-III	Accuracy	0.8325	ELECTRA (from scratch)
Mortality Prediction	MIMIC-III	Accuracy	0.8298	LSTM+SA (pretrained)
Mortality Prediction	MIMIC-III	Accuracy	0.828	LSTM (pretrained)
Mortality Prediction	MIMIC-III	Accuracy	0.7996	LSTM+SA (from scratch)
Medical waveform analysis	MIMIC-III	Accuracy	0.8443	ELECTRA (pretrained)
Medical waveform analysis	MIMIC-III	Accuracy	0.8325	ELECTRA (from scratch)
Medical waveform analysis	MIMIC-III	Accuracy	0.8298	LSTM+SA (pretrained)
Medical waveform analysis	MIMIC-III	Accuracy	0.828	LSTM (pretrained)
Medical waveform analysis	MIMIC-III	Accuracy	0.7996	LSTM+SA (from scratch)

MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining

Abstract

Results

Related Papers

MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining

Abstract

Results

Related Papers