TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming...

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin

2024-11-25AI AgentVisual Question Answering (VQA)Visual Question Answering
PaperPDFCode(official)

Abstract

An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects. When perceiving such images, multimodal large language models~(MLLMs) face limitations due to the restricted input resolution of the pretrained vision encoder and the cluttered, dense context of the image, resulting in a focus on primary objects while easily overlooking detailed ones. In this paper, we propose Zoom Eye, a tree search algorithm designed to navigate the hierarchical and visual nature of images to capture relevant information. Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs to simulate human zooming actions by searching along the image tree from root to leaf nodes, seeking out pertinent information, and accurately responding to related queries. We experiment on a series of elaborate high-resolution benchmarks and the results demonstrate that Zoom Eye not only consistently improves the performance of a series base MLLMs with large margin~(e.g., LLaVA-v1.5-7B increases by 34.57\% on $V^*$ Bench and 17.88\% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o. Our code is available at \href{https://github.com/om-ai-lab/ZoomEye}{https://github.com/om-ai-lab/ZoomEye}.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)V*benchAccuracy90.58LLaVA-OneVision7B w. ZoomEye
Visual Question AnsweringV*benchAccuracy90.58LLaVA-OneVision7B w. ZoomEye

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI2025-07-13Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians' Insights2025-07-09MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning2025-07-09