FormulaNet

ImagesTextsCC-BY 4.0Introduced 2022-08-29

FormulaNet

FormulaNet is a new large-scale Mathematical Formula Detection dataset. It consists of 46'672 pages of STEM documents from arXiv and has 13 types of labels. The dataset is split into a train set of 44'338 pages and a validation set of 2'334 pages. Due to copyrights reasons, we can only provide the list of papers, which must be downloaded and processed.

Labels

  • inline formulae
  • display formulae
  • headers
  • tables
  • figures
  • paragraphs
  • captions
  • footnotes
  • lists
  • bibliographies
  • display formulae reference number
  • display formulae with reference number
  • footnote reference number