TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/End-to-End 3D Dense Captioning with Vote2Cap-DETR

End-to-End 3D Dense Captioning with Vote2Cap-DETR

Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang Yu

2023-01-06CVPR 2023 13D dense captioningDense Captioning
PaperPDFCode(official)

Abstract

3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated ``detect-then-describe'' pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular \textbf{DE}tection \textbf{TR}ansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection and captioning in one-stage. 3) Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our Vote2Cap-DETR surpasses current state-of-the-arts by 11.13\% and 7.11\% in CIDEr@0.5IoU, respectively. Codes will be released soon.

Results

TaskDatasetMetricValueModel
Image CaptioningScanRefer DatasetBLEU-439.34Vote2Cap-DETR
Image CaptioningScanRefer DatasetCIDEr71.45Vote2Cap-DETR
Image CaptioningScanRefer DatasetMETEOR28.25Vote2Cap-DETR
Image CaptioningScanRefer DatasetROUGE-L59.33Vote2Cap-DETR
Image CaptioningNr3DBLEU-426.68Vote2Cap-DETR
Image CaptioningNr3DCIDEr43.84Vote2Cap-DETR
Image CaptioningNr3DMETEOR25.41Vote2Cap-DETR
Image CaptioningNr3DROUGE-L54.43Vote2Cap-DETR

Related Papers

STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving2025-06-06Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs2025-06-05TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action2025-05-023D CoCa: Contrastive Learners are 3D Captioners2025-04-133D Spatial Understanding in MLLMs: Disambiguation and Evaluation2024-12-09PerLA: Perceptive 3D Language Assistant2024-11-293D Scene Graph Guided Vision-Language Pre-training2024-11-27MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation2024-11-26