mC4
TextsCC BY 4.0Introduced 2020-10-22
mC4 is a multilingual variant of the C4 dataset called mC4. mC4 comprises natural text in 101 languages drawn from the public Common Crawl web scrape.
Source: mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer