TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/IconQA: A New Benchmark for Abstract Diagram Understanding...

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei zhang, Zhou Yu, Xiaodan Liang, Song-Chun Zhu

2021-10-25Question AnsweringMathematical ReasoningMath Word Problem SolvingObject RecognitionMathematical Question AnsweringArithmetic ReasoningVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode(official)

Abstract

Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images. However, aside from natural images, abstract diagrams with semantic richness are still understudied in visual understanding and reasoning research. In this work, we introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context. We release IconQA, a large-scale dataset that consists of 107,439 questions and three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. The IconQA dataset is inspired by real-world diagram word problems that highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning. Thus, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning. To facilitate potential IconQA models to learn semantic representations for icon images, we further release an icon dataset Icon645 which contains 645,687 colored icons on 377 classes. We conduct extensive user studies and blind experiments and reproduce a wide range of advanced VQA methods to benchmark the IconQA task. Also, we develop a strong IconQA baseline Patch-TRM that applies a pyramid cross-modal Transformer with input diagram embeddings pre-trained on the icon dataset. IconQA and Icon645 are available at https://iconqa.github.io.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)IconQAReasoning (Alg.)56.73Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Com.)87Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Cou.)77.81Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Est.)98.24Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Fra.)82.13Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Geo.)81.87Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Mea.)97.98Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Pat.)68.75Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Pro.)95.73Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Sce.)62.39Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Sen.)92.49Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Spa.)55.62Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Tim.)77.98Patch-TRM
Visual Question Answering (VQA)IconQASub-tasks (Blank)83.62Patch-TRM
Visual Question Answering (VQA)IconQASub-tasks (Img.)82.66Patch-TRM
Visual Question Answering (VQA)IconQASub-tasks (Txt.)75.19Patch-TRM
Visual Question Answering (VQA)IconQAReasoning (Alg.)50.55ViLT
Visual Question Answering (VQA)IconQAReasoning (Com.)84.95ViLT
Visual Question Answering (VQA)IconQAReasoning (Cou.)71.13ViLT
Visual Question Answering (VQA)IconQAReasoning (Est.)99.02ViLT
Visual Question Answering (VQA)IconQAReasoning (Fra.)75.81ViLT
Visual Question Answering (VQA)IconQAReasoning (Geo.)82.61ViLT
Visual Question Answering (VQA)IconQAReasoning (Mea.)98.91ViLT
Visual Question Answering (VQA)IconQAReasoning (Pat.)59.22ViLT
Visual Question Answering (VQA)IconQAReasoning (Pro.)87.65ViLT
Visual Question Answering (VQA)IconQAReasoning (Sce.)66.72ViLT
Visual Question Answering (VQA)IconQAReasoning (Sen.)86.1ViLT
Visual Question Answering (VQA)IconQAReasoning (Spa.)53.38ViLT
Visual Question Answering (VQA)IconQAReasoning (Tim.)69.99ViLT
Visual Question Answering (VQA)IconQASub-tasks (Blank)79.27ViLT
Visual Question Answering (VQA)IconQASub-tasks (Img.)79.67ViLT
Visual Question Answering (VQA)IconQASub-tasks (Txt.)72.69ViLT
Visual Question Answering (VQA)IconQAReasoning (Alg.)51.1ViT
Visual Question Answering (VQA)IconQAReasoning (Com.)82.12ViT
Visual Question Answering (VQA)IconQAReasoning (Cou.)70.84ViT
Visual Question Answering (VQA)IconQAReasoning (Est.)98.95ViT
Visual Question Answering (VQA)IconQAReasoning (Fra.)77.41ViT
Visual Question Answering (VQA)IconQAReasoning (Geo.)82.6ViT
Visual Question Answering (VQA)IconQAReasoning (Mea.)98.76ViT
Visual Question Answering (VQA)IconQAReasoning (Pat.)58.46ViT
Visual Question Answering (VQA)IconQAReasoning (Pro.)86.07ViT
Visual Question Answering (VQA)IconQAReasoning (Sce.)68.8ViT
Visual Question Answering (VQA)IconQAReasoning (Sen.)84.72ViT
Visual Question Answering (VQA)IconQAReasoning (Spa.)54.64ViT
Visual Question Answering (VQA)IconQAReasoning (Tim.)68.66ViT
Visual Question Answering (VQA)IconQASub-tasks (Blank)78.92ViT
Visual Question Answering (VQA)IconQASub-tasks (Img.)79.15ViT
Visual Question Answering (VQA)IconQASub-tasks (Txt.)72.34ViT
Visual Question Answering (VQA)IconQAReasoning (Alg.)49.18UNITER
Visual Question Answering (VQA)IconQAReasoning (Com.)83.67UNITER
Visual Question Answering (VQA)IconQAReasoning (Cou.)71.01UNITER
Visual Question Answering (VQA)IconQAReasoning (Est.)99.41UNITER
Visual Question Answering (VQA)IconQAReasoning (Fra.)78.37UNITER
Visual Question Answering (VQA)IconQAReasoning (Geo.)81.31UNITER
Visual Question Answering (VQA)IconQAReasoning (Mea.)99.38UNITER
Visual Question Answering (VQA)IconQAReasoning (Pat.)60.81UNITER
Visual Question Answering (VQA)IconQAReasoning (Pro.)87.84UNITER
Visual Question Answering (VQA)IconQAReasoning (Sce.)61.25UNITER
Visual Question Answering (VQA)IconQAReasoning (Sen.)86.1UNITER
Visual Question Answering (VQA)IconQAReasoning (Spa.)48.34UNITER
Visual Question Answering (VQA)IconQAReasoning (Tim.)69.77UNITER
Visual Question Answering (VQA)IconQASub-tasks (Blank)78.53UNITER
Visual Question Answering (VQA)IconQASub-tasks (Img.)78.71UNITER
Visual Question Answering (VQA)IconQASub-tasks (Txt.)72.39UNITER
Visual Question Answering (VQA)IconQAReasoning (Alg.)50.27DFAF
Visual Question Answering (VQA)IconQAReasoning (Com.)81.69DFAF
Visual Question Answering (VQA)IconQAReasoning (Cou.)70.68DFAF
Visual Question Answering (VQA)IconQAReasoning (Est.)99.02DFAF
Visual Question Answering (VQA)IconQAReasoning (Fra.)77.6DFAF
Visual Question Answering (VQA)IconQAReasoning (Geo.)81.8DFAF
Visual Question Answering (VQA)IconQAReasoning (Mea.)98.83DFAF
Visual Question Answering (VQA)IconQAReasoning (Pat.)56.6DFAF
Visual Question Answering (VQA)IconQAReasoning (Pro.)85.7DFAF
Visual Question Answering (VQA)IconQAReasoning (Sce.)67.01DFAF
Visual Question Answering (VQA)IconQAReasoning (Sen.)84.11DFAF
Visual Question Answering (VQA)IconQAReasoning (Spa.)51.42DFAF
Visual Question Answering (VQA)IconQAReasoning (Tim.)67.72DFAF
Visual Question Answering (VQA)IconQASub-tasks (Blank)78.28DFAF
Visual Question Answering (VQA)IconQASub-tasks (Img.)77.72DFAF
Visual Question Answering (VQA)IconQASub-tasks (Txt.)72.17DFAF
Visual Question Answering (VQA)IconQAReasoning (Alg.)47.32MCAN
Visual Question Answering (VQA)IconQAReasoning (Com.)82.73MCAN
Visual Question Answering (VQA)IconQAReasoning (Cou.)68.94MCAN
Visual Question Answering (VQA)IconQAReasoning (Est.)99.08MCAN
Visual Question Answering (VQA)IconQAReasoning (Fra.)76.2MCAN
Visual Question Answering (VQA)IconQAReasoning (Geo.)79.86MCAN
Visual Question Answering (VQA)IconQAReasoning (Mea.)98.99MCAN
Visual Question Answering (VQA)IconQAReasoning (Pat.)54.79MCAN
Visual Question Answering (VQA)IconQAReasoning (Pro.)84.87MCAN
Visual Question Answering (VQA)IconQAReasoning (Sce.)62.49MCAN
Visual Question Answering (VQA)IconQAReasoning (Sen.)83.25MCAN
Visual Question Answering (VQA)IconQAReasoning (Spa.)49.7MCAN
Visual Question Answering (VQA)IconQAReasoning (Tim.)68MCAN
Visual Question Answering (VQA)IconQASub-tasks (Blank)74.52MCAN
Visual Question Answering (VQA)IconQASub-tasks (Img.)77.36MCAN
Visual Question Answering (VQA)IconQASub-tasks (Txt.)71.25MCAN
Visual Question Answering (VQA)IconQAReasoning (Alg.)50.62ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Com.)75.6ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Cou.)71.05ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Est.)99.22ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Fra.)74.09ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Geo.)80.05ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Mea.)99.07ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Pat.)62.78ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Pro.)70.94ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Sce.)58.52ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Sen.)81.78ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Spa.)49.46ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Tim.)66.72ViLBERT
Visual Question Answering (VQA)IconQASub-tasks (Blank)77.08ViLBERT
Visual Question Answering (VQA)IconQASub-tasks (Img.)76.66ViLBERT
Visual Question Answering (VQA)IconQASub-tasks (Txt.)70.47ViLBERT
Visual Question Answering (VQA)IconQAReasoning (Alg.)47.46BAN
Visual Question Answering (VQA)IconQAReasoning (Com.)82.12BAN
Visual Question Answering (VQA)IconQAReasoning (Cou.)67.56BAN
Visual Question Answering (VQA)IconQAReasoning (Est.)97.06BAN
Visual Question Answering (VQA)IconQAReasoning (Fra.)73.77BAN
Visual Question Answering (VQA)IconQAReasoning (Geo.)79.99BAN
Visual Question Answering (VQA)IconQAReasoning (Mea.)96.5BAN
Visual Question Answering (VQA)IconQAReasoning (Pat.)55.67BAN
Visual Question Answering (VQA)IconQAReasoning (Pro.)82.45BAN
Visual Question Answering (VQA)IconQAReasoning (Sce.)66.92BAN
Visual Question Answering (VQA)IconQAReasoning (Sen.)82.12BAN
Visual Question Answering (VQA)IconQAReasoning (Spa.)53.2BAN
Visual Question Answering (VQA)IconQAReasoning (Tim.)66.5BAN
Visual Question Answering (VQA)IconQASub-tasks (Blank)75.54BAN
Visual Question Answering (VQA)IconQASub-tasks (Img.)76.33BAN
Visual Question Answering (VQA)IconQASub-tasks (Txt.)70.82BAN
Visual Question Answering (VQA)IconQAReasoning (Alg.)50Top-Down
Visual Question Answering (VQA)IconQAReasoning (Com.)80.65Top-Down
Visual Question Answering (VQA)IconQAReasoning (Cou.)65.01Top-Down
Visual Question Answering (VQA)IconQAReasoning (Est.)99.54Top-Down
Visual Question Answering (VQA)IconQAReasoning (Fra.)72.43Top-Down
Visual Question Answering (VQA)IconQAReasoning (Geo.)80.07Top-Down
Visual Question Answering (VQA)IconQAReasoning (Mea.)99.46Top-Down
Visual Question Answering (VQA)IconQAReasoning (Pat.)55.01Top-Down
Visual Question Answering (VQA)IconQAReasoning (Pro.)83.75Top-Down
Visual Question Answering (VQA)IconQAReasoning (Sce.)58.22Top-Down
Visual Question Answering (VQA)IconQAReasoning (Sen.)84.54Top-Down
Visual Question Answering (VQA)IconQAReasoning (Spa.)45.78Top-Down
Visual Question Answering (VQA)IconQAReasoning (Tim.)68.28Top-Down
Visual Question Answering (VQA)IconQASub-tasks (Blank)73.03Top-Down
Visual Question Answering (VQA)IconQASub-tasks (Img.)75.92Top-Down
Visual Question Answering (VQA)IconQASub-tasks (Txt.)68.51Top-Down
Visual Question Answering (VQA)IconQAReasoning (Alg.)11.12Random
Visual Question Answering (VQA)IconQAReasoning (Com.)41.2Random
Visual Question Answering (VQA)IconQAReasoning (Cou.)18.38Random
Visual Question Answering (VQA)IconQAReasoning (Est.)3.62Random
Visual Question Answering (VQA)IconQAReasoning (Fra.)34.84Random
Visual Question Answering (VQA)IconQAReasoning (Geo.)30.3Random
Visual Question Answering (VQA)IconQAReasoning (Mea.)0.36Random
Visual Question Answering (VQA)IconQAReasoning (Pat.)34.81Random
Visual Question Answering (VQA)IconQAReasoning (Pro.)38.81Random
Visual Question Answering (VQA)IconQAReasoning (Sce.)34.25Random
Visual Question Answering (VQA)IconQAReasoning (Sen.)45.16Random
Visual Question Answering (VQA)IconQAReasoning (Spa.)36.49Random
Visual Question Answering (VQA)IconQAReasoning (Tim.)35.82Random
Visual Question Answering (VQA)IconQASub-tasks (Blank)0.29Random
Visual Question Answering (VQA)IconQASub-tasks (Img.)41.7Random
Visual Question Answering (VQA)IconQASub-tasks (Txt.)36.87Random
Visual Question Answering (VQA)IconQAReasoning (Alg.)28.02Q-Only
Visual Question Answering (VQA)IconQAReasoning (Com.)48.19Q-Only
Visual Question Answering (VQA)IconQAReasoning (Cou.)33.63Q-Only
Visual Question Answering (VQA)IconQAReasoning (Est.)40.46Q-Only
Visual Question Answering (VQA)IconQAReasoning (Fra.)33.06Q-Only
Visual Question Answering (VQA)IconQAReasoning (Geo.)38.03Q-Only
Visual Question Answering (VQA)IconQAReasoning (Mea.)38.07Q-Only
Visual Question Answering (VQA)IconQAReasoning (Pat.)33.66Q-Only
Visual Question Answering (VQA)IconQAReasoning (Pro.)40.76Q-Only
Visual Question Answering (VQA)IconQAReasoning (Sce.)35.37Q-Only
Visual Question Answering (VQA)IconQAReasoning (Sen.)45.25Q-Only
Visual Question Answering (VQA)IconQAReasoning (Spa.)37.14Q-Only
Visual Question Answering (VQA)IconQAReasoning (Tim.)48.09Q-Only
Visual Question Answering (VQA)IconQASub-tasks (Blank)28.45Q-Only
Visual Question Answering (VQA)IconQASub-tasks (Img.)41.64Q-Only
Visual Question Answering (VQA)IconQASub-tasks (Txt.)36.86Q-Only
Visual Question Answering (VQA)IconQAReasoning (Alg.)31.73I-Only
Visual Question Answering (VQA)IconQAReasoning (Com.)45.26I-Only
Visual Question Answering (VQA)IconQAReasoning (Cou.)37.64I-Only
Visual Question Answering (VQA)IconQAReasoning (Est.)62.29I-Only
Visual Question Answering (VQA)IconQAReasoning (Fra.)32.48I-Only
Visual Question Answering (VQA)IconQAReasoning (Geo.)38.71I-Only
Visual Question Answering (VQA)IconQAReasoning (Mea.)64.02I-Only
Visual Question Answering (VQA)IconQAReasoning (Pat.)36.29I-Only
Visual Question Answering (VQA)IconQAReasoning (Pro.)37.51I-Only
Visual Question Answering (VQA)IconQAReasoning (Sce.)35.47I-Only
Visual Question Answering (VQA)IconQAReasoning (Sen.)45.25I-Only
Visual Question Answering (VQA)IconQAReasoning (Spa.)37.52I-Only
Visual Question Answering (VQA)IconQAReasoning (Tim.)47.37I-Only
Visual Question Answering (VQA)IconQASub-tasks (Blank)46.65I-Only
Visual Question Answering (VQA)IconQASub-tasks (Img.)41.56I-Only
Visual Question Answering (VQA)IconQASub-tasks (Txt.)36.02I-Only

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16