Cleaned_Lang8

TextsIntroduced 2024-01-30

Lang-8 Preprocessed Dataset (for GED):

  • Dataset: Lang-8, a publicly available dataset containing user-generated content, primarily from second-language learners, focused on writing errors.
  • Task: Grammatical Error Detection (GED).
  • Size: 200,000 sentences, with each sentence labeled as '0' for incorrect and '1' for its corrected version.
  • Preprocessing: The dataset has been cleaned and transformed to improve model performance, including removing noise, handling inconsistent annotations, and preparing it for training.
  • Usage: Ideal for training and evaluating models for GED and GEC.