TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/OBELICS: An Open Web-Scale Filtered Dataset of Interleaved...

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh

2023-06-21NeurIPS 2023 11MMR total
PaperPDFCodeCode(official)

Abstract

Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.

Results

TaskDatasetMetricValueModel
MMR totalMRR-BenchmarkTotal Column Score139Idefics-80B

Related Papers

MMR: Evaluating Reading Ability of Large Multimodal Models2024-08-26Claude 3.5 Sonnet Model Card Addendum2024-06-24GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding2024-06-14What matters when building vision-language models?2024-05-03Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone2024-04-22InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks2023-12-21Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models2023-11-11The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)2023-09-29