Scientific statement classification dataset from arXMLiv 08.2018

TextsSIGMathLing members, research-onlyIntroduced 2019-08-29

This resource contains 10.5 million paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the first paragraph,immediately following the heading. Headings include both structural sections (e.g. Introduction), as well as scholarly statement annotations, (e.g. Definition, Proof, Remark).

The annotated statement dataset is derived from arXMLiv, a machine-readable HTML5 representation of the arXiv corpus of scientific articles.

Examples

Definition with math lexemes (main data, single sentence, linebreaks for readability):

a directed quantum turing automaton is a quadruple
  italic_T RELOP_equals OPEN_( caligraphic_H PUNCT_, caligraphic_K PUNCT_, caligraphic_L PUNCT_, italic_tau CLOSE_) PUNCT_,
where
  caligraphic_H caligraphic_K and caligraphic_L
are finite dimensional hilbert spaces over the complex field blackboard_C and
  italic_tau METARELOP_colon caligraphic_H MULOP_tensor_product caligraphic_K ARROW_rightarrow
    caligraphic_H MULOP_tensor_product caligraphic_L
is an isometry in fdhilb

source: definition/1e4a1aea317bbf363c5314fb25eaf72c8a350a1007bb8aafc542e188405b93d5.txt

Same definition without math lexemes (nomath data, single sentence, linebreaks for readability):

a directed quantum turing automaton is a quadruple
  where and are finite dimensional hilbert spaces over the complex field and
  is an isometry in fdhilb

nomath source: definition/35b170bae4259a5c430846116142d4e4a45097e52daf818b78ea378d94d14a21.txt