TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Visual Program Distillation: Distilling Tools and Programm...

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, Ariel Fuxman

2023-12-05CVPR 2024 1Large Language ModelObject CountingVisual Question Answering (VQA)Meme ClassificationLanguage Modelling
PaperPDF

Abstract

Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)A-OKVQADA VQA Score68.2PaLI-X-VPD
Visual Question Answering (VQA)A-OKVQAMC Accuracy80.4PaLI-X-VPD
Visual Question Answering (VQA)OK-VQAAccuracy66.8PaLI-X-VPD
Visual Question Answering (VQA)GQA test-devAccuracy67.3PaLI-X-VPD
Object CountingTallyQA-ComplexAccuracy76.6PaLI-X-VPD
Object CountingTallyQA-SimpleAccuracy86.2PaLI-X-VPD
Meme ClassificationHateful MemesROC-AUC0.892PaLI-X-VPD

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17