OpenWebText
TextsCustom
OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).
Source: RoBERTa: A Robustly Optimized BERT Pretraining Approach
OpenWebText is an open-source recreation of the WebText corpus. The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB).
Source: RoBERTa: A Robustly Optimized BERT Pretraining Approach