mC4

TextsCC BY 4.0Introduced 2020-10-22

mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.

Source: mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer