C4

Colossal Clean Crawled Corpus

TextsUnknownIntroduced 2019-10-23

C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

The dataset can be downloaded in a pre-processed form from allennlp.