OpenWebText

TextsCustom

OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).

Source: RoBERTa: A Robustly Optimized BERT Pretraining Approach