Liu et al. Corpus

TextsIntroduced 2019-05-08

The Liu et al. Corpus is a pretraining dataset for large language models. It consists of 160Gb of news, books, stories, and web text.