TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang

2020-12-03CVPR 2021 13D dense captioningDescriptive3D Question Answering (3D-QA)object-detectionDense Captioning3D Object DetectionObject Detection
PaperPDF

Abstract

We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% CiDEr@0.5IoUimprovement).

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)SQA3DExact Match41Scan2Cap
Image CaptioningScanRefer DatasetBLEU-434.25Scan2Cap
Image CaptioningScanRefer DatasetCIDEr53.73Scan2Cap
Image CaptioningScanRefer DatasetMETEOR26.14Scan2Cap
Image CaptioningScanRefer DatasetROUGE-L54.95Scan2Cap
Image CaptioningNr3DBLEU-417.24Scan2Cap
Image CaptioningNr3DCIDEr27.47Scan2Cap
Image CaptioningNr3DMETEOR21.8Scan2Cap
Image CaptioningNr3DROUGE-L49.06Scan2Cap

Related Papers

DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16