MMC4

Multimodal C4

ImagesTextsODC-BYIntroduced 2023-04-14

Multimodal C4 (MMC4) is an augmentation of the popular text-only c4 corpus with images interleaved. The corpus contains 103M documents containing 585M images interleaved with 43B English tokens.

Source: Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

Image Source: MMC4 - Github Repo