TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Spatiality-guided Transformer for 3D Dense Captioning on P...

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai

2022-04-22Machine Translation3D dense captioningCaption GenerationScene Understandingobject-detectionDense Captioning3D Object DetectionObject Detection
PaperPDFCode(official)

Abstract

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural language description on visual appearance and spatial relations for each scene object of interest. To detect and describe objects in a scene, following the spirit of neural machine translation, we propose a transformer-based encoder-decoder architecture, namely SpaCap3D, to transform objects into descriptions, where we especially investigate the relative spatiality of objects in 3D scenes and design a spatiality-guided encoder via a token-to-token spatial relation learning objective and an object-centric decoder for precise and spatiality-enhanced object caption generation. Evaluated on two benchmark datasets, ScanRefer and ReferIt3D, our proposed SpaCap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5IoU, respectively. Our project page with source code and supplementary files is available at https://SpaCap3D.github.io/ .

Results

TaskDatasetMetricValueModel
Image CaptioningScanRefer DatasetBLEU-435.3SpaCap3d
Image CaptioningScanRefer DatasetCIDEr58.06SpaCap3d
Image CaptioningScanRefer DatasetMETEOR26.16SpaCap3d
Image CaptioningScanRefer DatasetROUGE-L55.03SpaCap3d
Image CaptioningNr3DBLEU-419.92SpaCap3d
Image CaptioningNr3DCIDEr33.71SpaCap3d
Image CaptioningNr3DMETEOR22.61SpaCap3d
Image CaptioningNr3DROUGE-L50.5SpaCap3d

Related Papers

Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16