TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Contextual Modeling for 3D Dense Captioning on Point Clouds

Contextual Modeling for 3D Dense Captioning on Point Clouds

Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma

2022-10-083D dense captioningDense Captioning
PaperPDF

Abstract

3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds and generate a distinctive natural language sentence for describing each located object. However, the existing methods mainly focus on mining inter-object relationship, while ignoring contextual information, especially the non-object details and background environment within the point clouds, thus leading to low-quality descriptions, such as inaccurate relative position information. In this paper, we make the first attempt to utilize the point clouds clustering features as the contextual information to supply the non-object details and background environment of the point clouds and incorporate them into the 3D dense captioning task. We propose two separate modules, namely the Global Context Modeling (GCM) and Local Context Modeling (LCM), in a coarse-to-fine manner to perform the contextual modeling of the point clouds. Specifically, the GCM module captures the inter-object relationship among all objects with global contextual information to obtain more complete scene information of the whole point clouds. The LCM module exploits the influence of the neighboring objects of the target object and local contextual information to enrich the object representations. With such global and local contextual modeling strategies, our proposed model can effectively characterize the object representations and contextual information and thereby generate comprehensive and detailed descriptions of the located objects. Extensive experiments on the ScanRefer and Nr3D datasets demonstrate that our proposed method sets a new record on the 3D dense captioning task, and verify the effectiveness of our raised contextual modeling of point clouds.

Results

TaskDatasetMetricValueModel
Image CaptioningScanRefer DatasetBLEU-426.64Contextual
Image CaptioningScanRefer DatasetCIDEr50.29Contextual
Image CaptioningScanRefer DatasetMETEOR22.57Contextual
Image CaptioningScanRefer DatasetROUGE-L44.71Contextual
Image CaptioningNr3DBLEU-420.42Contextual
Image CaptioningNr3DCIDEr35.26Contextual
Image CaptioningNr3DMETEOR22.77Contextual
Image CaptioningNr3DROUGE-L50.78Contextual

Related Papers

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving2025-06-06Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs2025-06-05TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action2025-05-023D CoCa: Contrastive Learners are 3D Captioners2025-04-133D Spatial Understanding in MLLMs: Disambiguation and Evaluation2024-12-09PerLA: Perceptive 3D Language Assistant2024-11-293D Scene Graph Guided Vision-Language Pre-training2024-11-27MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation2024-11-26