InpaintCOCO

ImagesTextsdiverse licensesIntroduced 2024-08-16

InpaintCOCO is a benchmark to understand fine-grained concepts in multimodal models (vision-language) similar to Winoground. To our knowledge InpaintCOCO is the first benchmark, which consists of image pairs with minimum differences, so that the visual representation can be analyzed in a more standardized setting.

A data sample contains 2 images and 2 corresponding captions that differ only in one object, the color of an object, or the size of an object.

The metric used in the paper compares if true image-text pairs are more similar than wrong image-text combinations and that for both image-text pairs: \begin{equation} \begin{split} \operatorname{sim}(i_\text{COCO},t_\text{COCO}) > \operatorname{sim}(i_\text{inp},t_\text{COCO}) \quad \land \ \operatorname{sim}(i_\text{inp},t_\text{inp}) > \operatorname{sim}(i_\text{COCO},t_\text{inp}) \end{split} \label{eq:challengeset} \end{equation}

InpaintCOCO is published in Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR) at ACL 2024.