WanJuan

ImagesTextsVideosCC BY 4.0 LicenseIntroduced 2023-08-21

WanJuan is a large-scale training corpus that includes multiple modalities. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB.