Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai

2022-04-22Machine Translation 3D dense captioning Caption Generation Scene Understanding object-detection Dense Captioning 3D Object Detection Object Detection

Paper PDF Code(official)

Abstract

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest. To detect and describe objects in a scene, following the spirit of neural machine translation, we propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions, where we especially investigate the relative spatiality of objects in 3D scenes and design a spatiality-guided encoder via a token-to-token spatial relation learning objective and an object-centric decoder for precise and spatiality-enhanced object caption generation. Evaluated on two benchmark datasets, ScanRefer and ReferIt3D, our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively. Our project page with source code and supplementary files is available at https://SpaCap3D.github.io/ .

Results

Task	Dataset	Metric	Value	Model
Image Captioning	ScanRefer Dataset	BLEU-4	35.3	SpaCap3d
Image Captioning	ScanRefer Dataset	CIDEr	58.06	SpaCap3d
Image Captioning	ScanRefer Dataset	METEOR	26.16	SpaCap3d
Image Captioning	ScanRefer Dataset	ROUGE-L	55.03	SpaCap3d
Image Captioning	Nr3D	BLEU-4	19.92	SpaCap3d
Image Captioning	Nr3D	CIDEr	33.71	SpaCap3d
Image Captioning	Nr3D	METEOR	22.61	SpaCap3d
Image Captioning	Nr3D	ROUGE-L	50.5	SpaCap3d

Related Papers

Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17 Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17 Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17 Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17 Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16