WikiWeb2M
Wikipedia Webpage 2M
ImagesTextsIntroduced 2023-05-09
Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.
Source: WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset
Image Source: WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset