Papers With Code 2 | ML Benchmarks, SotA Results & Code

MediConfusion is a challenging medical Visual Question Answering (VQA) benchmark dataset, that probes the failure modes of medical Multimodal Large Language Models (MLLMs) from a vision perspective. We reveal that state-of-the-art models are easily confused by image pairs that are otherwise visually dissimilar and clearly distinct for medical experts. Our benchmark consists of 176 confusing pairs. A confusing pair is a set of two images that share the same question and corresponding answer options, but the correct answer is different for the images. We evaluate models based on their ability to answer both questions correctly within a confusing pair, which we call set accuracy. This metric indicates how well models can tell the two images apart, as a model that selects the same answer option for both images for all pairs will receive 0% set accuracy. We also report confusion, a metric that describes the proportion of confusing pairs where the model has chosen the same answer option for both images.