Papers With Code 2 | ML Benchmarks, SotA Results & Code

The University of Massachusetts Amherst citation field extraction dataset contains labels and segments for extracted citations from articles found on arXiv. Compared to previous standard datasets in citation field extraction, this one had 4 times more data and provided detailed nested labels rather than coarse-grained flat labels, alongside drawing from 4 different academic disciplines versus 1 - namely computer science, mathematics, physics, and quantitative biology.

It consisted of 6,000 unlabeled citation strings, with 1829 labeled to date at the time of its last publication - 2476 according to the latest citation from 'Using BibTeX to Automatically Generate Labeled Data for Citation Field Extraction' - Dung Thai, Zhiyang Xu, Nicholas Monath, Boris Veytsman, and Andrew McCallum. Each citation string was labeled hierarchically, separating coarse-grain and fine-grain labeled segments.

Dataset introduced in the following paper:

Sam Anzaroot and Andrew McCallum. A new dataset for fine-grained citation field extraction. In ICML Workshop on Peer Reviewing and Publishing Models (PEER), 2013.

UMass Citation Field Extraction