OWT2

OpenWebtext2

OpenWebText2 is an enhanced version of the original OpenWebTextCorpus. It encompasses all Reddit submissions from 2005 up until April 2020, with additional months becoming available after the corresponding PushShift dump files are released¹²³. Here are the key details:

  • Dataset Description:

    • OpenWebText2 is part of the EleutherAi/The Pile dataset.
    • It covers Reddit submissions and is designed to be a high-quality internet dataset.
    • The dataset was created by scraping URLs extracted from Reddit submissions with a minimum score of 3 as a proxy for quality⁴.
    • The plug-and-play version of OpenWebText2 contains 17,103,059 documents and is approximately 65.86 GB when uncompressed².
  • Features:

    • Each document in OpenWebText2 has two main features:
      • Title: A string representing the title of the submission.
      • Text: A string containing the content of the submission¹.
  • License and Version:

    • License: No known license.
    • Version: 1.0.0¹.
  • Size:

    • Download Size: Approximately 27.3 GiB.
    • Dataset Size: Approximately 63.8 GiB³.

(1) the_pile_openwebtext2 | TensorFlow Datasets. https://www.tensorflow.org/datasets/community_catalog/huggingface/the_pile_openwebtext2. (2) GitHub - EleutherAI/openwebtext2. https://github.com/EleutherAI/openwebtext2. (3) the_pile_openwebtext2 · Datasets at Hugging Face. https://huggingface.co/datasets/the_pile_openwebtext2. (4) Eleuther AI site | OpenWebText2. https://researcher2.eleuther.ai/projects/open-web-text2/. (5) OpenWebText2 — EleutherAI. https://www.eleuther.ai/artifacts/openwebtext2.