WikiWeb2M

Wikipedia Webpage 2M

ImagesTextsIntroduced 2023-05-09

Wikipedia Webpage 2M (WikiWeb2M) is a multimodal open source dataset consisting of over 2 million English Wikipedia articles. It is created by rescraping the ∼2M English articles in WIT. Each webpage sample includes the page URL and title, section titles, text, and indices, images and their captions.

Source: WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Image Source: WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset