GUM

Georgetown University Multilayer corpus

SpeechTextsCC-BY-NC-SA

GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:

  • Multiple POS tags, morphological features and lemmatization
  • Sentence segmentation and rough speech act
  • Document structure in TEI XML (paragraphs, headings, figures, etc.)
  • ISO date/time annotations
  • Speaker and addressee information (where relevant)
  • Constituent and dependency syntax
  • Information status (given, accessible, new, split antecedent)
  • Entity and coreference annotation, including bridging anaphora
  • Entity linking (Wikification)
  • Discourse parses in Rhetorical Structure Theory and discourse dependencies