TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Integrating curation into scientific publishing to train A...

Integrating curation into scientific publishing to train AI models

Jorge Abreu-Vicente, Hannah Sonntag, Thomas Eidens, Cassie S. Mitchell, Thomas Lemberger

2023-10-31Entity LinkingNamed Entity Recognition (NER)
PaperPDFCode(official)

Abstract

High throughput extraction and structured labeling of data from academic articles is critical to enable downstream machine learning applications and secondary analyses. We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions. Natural language processing (NLP) was combined with human-in-the-loop feedback from the original authors to increase annotation accuracy. Annotation included eight classes of bioentities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases) plus additional classes delineating the entities' roles in experiment designs and methodologies. The resultant dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 articles in molecular and cell biology. We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task assessing whether an entity is a controlled intervention target or a measurement object. We also illustrate the use of our dataset in performing a multi-modal task for segmenting figures into panel images and their corresponding captions.

Results

TaskDatasetMetricValueModel
Open Information ExtractionThe EMBO SourceData-NLP datasetF1 Micro84.7BioLinkBERT-large
Information ExtractionThe EMBO SourceData-NLP datasetF1 Micro84.7BioLinkBERT-large
Named Entity Recognition (NER)The EMBO SourceData-NLP datasetF1 Micro84.7BioLinkBERT-large
Event ExtractionThe EMBO SourceData-NLP datasetF1 Micro84.7BioLinkBERT-large

Related Papers

Flippi: End To End GenAI Assistant for E-Commerce2025-07-08Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models2025-06-28Better Semi-supervised Learning for Multi-domain ASR Through Incremental Retraining and Data Filtering2025-06-05Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering2025-06-04LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World2025-06-01EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models2025-05-29Label-Guided In-Context Learning for Named Entity Recognition2025-05-29Distilling Closed-Source LLM's Knowledge for Locally Stable and Economic Biomedical Entity Linking2025-05-26