Integrating curation into scientific publishing to train AI models

Jorge Abreu-Vicente, Hannah Sonntag, Thomas Eidens, Cassie S. Mitchell, Thomas Lemberger

2023-10-31Entity Linking Named Entity Recognition (NER)

Abstract

High throughput extraction and structured labeling of data from academic articles is critical to enable downstream machine learning applications and secondary analyses. We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions. Natural language processing (NLP) was combined with human-in-the-loop feedback from the original authors to increase annotation accuracy. Annotation included eight classes of bioentities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases) plus additional classes delineating the entities' roles in experiment designs and methodologies. The resultant dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 articles in molecular and cell biology. We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task assessing whether an entity is a controlled intervention target or a measurement object. We also illustrate the use of our dataset in performing a multi-modal task for segmenting figures into panel images and their corresponding captions.

Results

Task	Dataset	Metric	Value	Model
Open Information Extraction	The EMBO SourceData-NLP dataset	F1 Micro	84.7	BioLinkBERT-large
Information Extraction	The EMBO SourceData-NLP dataset	F1 Micro	84.7	BioLinkBERT-large
Named Entity Recognition (NER)	The EMBO SourceData-NLP dataset	F1 Micro	84.7	BioLinkBERT-large
Event Extraction	The EMBO SourceData-NLP dataset	F1 Micro	84.7	BioLinkBERT-large

Related Papers

Flippi: End To End GenAI Assistant for E-Commerce2025-07-08 Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models2025-06-28 Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering2025-06-05 Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering2025-06-04 LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World2025-06-01 EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models2025-05-29 Label-Guided In-Context Learning for Named Entity Recognition2025-05-29 Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking2025-05-26