Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang

2020-12-03CVPR 2021 13D dense captioning Descriptive 3D Question Answering (3D-QA)object-detection Dense Captioning 3D Object Detection Object Detection

Paper PDF

Abstract

We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% CiDEr@0.5IoUimprovement).

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	SQA3D	Exact Match	41	Scan2Cap
Image Captioning	ScanRefer Dataset	BLEU-4	34.25	Scan2Cap
Image Captioning	ScanRefer Dataset	CIDEr	53.73	Scan2Cap
Image Captioning	ScanRefer Dataset	METEOR	26.14	Scan2Cap
Image Captioning	ScanRefer Dataset	ROUGE-L	54.95	Scan2Cap
Image Captioning	Nr3D	BLEU-4	17.24	Scan2Cap
Image Captioning	Nr3D	CIDEr	27.47	Scan2Cap
Image Captioning	Nr3D	METEOR	21.8	Scan2Cap
Image Captioning	Nr3D	ROUGE-L	49.06	Scan2Cap

Related Papers

DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17 Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17 Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17 Assay2Mol: large language model-based drug design using BioAssay context2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16