MMC4
Multimodal C4
ImagesTextsODC-BYIntroduced 2023-04-14
Multimodal C4 (MMC4) is an augmentation of the popular text-only c4 corpus with images interleaved. The corpus contains 103M documents containing 585M images interleaved with 43B English tokens.
Source: Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text
Image Source: MMC4 - Github Repo