Papers With Code 2 | ML Benchmarks, SotA Results & Code

VisCon-100K is a dataset specially designed to facilitate fine-tuning of vision-language models (VLMs) by leveraging interleaved image-text web documents. Derived from 45K web documents of the OBELICS dataset, this release contains 100K image conversation samples. GPT-4V is used to generate image-contextual captions, while OpenChat 3.5 converts these captions into diverse free-form and multiple-choice Q&A pairs. This approach not only focuses on fine-grained visual content but also incorporates the accompanying web context to yield superior performance. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger VisCon-1M dataset