TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual...

Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World

Rujie Wu, Xiaojian Ma, Zhenliang Zhang, Wei Wang, Qing Li, Song-Chun Zhu, Yizhou Wang

2023-10-16Few-Shot LearningLogical ReasoningFormVisual Reasoning
PaperPDFCode(official)

Abstract

We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs): Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concepts, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even conceived a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.

Results

TaskDatasetMetricValueModel
Visual ReasoningBongard-OpenWorld2-Class Accuracy91Human
Visual ReasoningBongard-OpenWorld2-Class Accuracy64SNAIL
Visual ReasoningBongard-OpenWorld2-Class Accuracy63.8InstructBLIP + GPT-4
Visual ReasoningBongard-OpenWorld2-Class Accuracy63.3BLIP-2 + ChatGPT (Fine-tuned)
Visual ReasoningBongard-OpenWorld2-Class Accuracy55.5InstructBLIP + ChatGPT + Neuro-Symbolic
Visual ReasoningBongard-OpenWorld2-Class Accuracy49.3ChatCaptioner + ChatGPT
Visual ReasoningBongard-OpenWorld2-Class Accuracy49.3Otter

Related Papers

GLAD: Generalizable Tuning for Vision-Language Models2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection2025-07-10An Enhanced Privacy-preserving Federated Few-shot Learning Framework for Respiratory Disease Diagnosis2025-07-10PyVision: Agentic Vision with Dynamic Tooling2025-07-10Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning2025-07-09