TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Bridging the Gap between 2D and 3D Visual Question Answeri...

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Wentao Mo, Yang Liu

2024-02-24Question Answering3D Question Answering (3D-QA)Visual Question Answering (VQA)Visual Question Answering
PaperPDFCode(official)

Abstract

In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at $\href{https://github.com/matthewdm0816/BridgeQA}{\text{this URL}}$.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-134.49BridgeQA
Visual Question Answering (VQA)ScanQA Test w/ objectsBLEU-424.06BridgeQA
Visual Question Answering (VQA)ScanQA Test w/ objectsCIDEr83.75BridgeQA
Visual Question Answering (VQA)ScanQA Test w/ objectsExact Match31.29BridgeQA
Visual Question Answering (VQA)ScanQA Test w/ objectsMETEOR16.51BridgeQA
Visual Question Answering (VQA)ScanQA Test w/ objectsROUGE43.26BridgeQA

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16