TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Context and Geometry Aware Voxel Transformer for Semantic ...

Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

Zhu Yu, Runmin Zhang, Jiacheng Ying, Junchen Yu, Xiaohai Hu, Lun Luo, Si-Yuan Cao, Hui-Liang Shen

2024-05-223D Semantic Scene Completion from a single RGB image
PaperPDFCode(official)

Abstract

Vision-based Semantic Scene Completion (SSC) has gained much attention due to its widespread applications in various 3D perception tasks. Existing sparse-to-dense approaches typically employ shared context-independent queries across various input images, which fails to capture distinctions among them as the focal regions of different inputs vary and may result in undirected feature aggregation of cross-attention. Additionally, the absence of depth information may lead to points projected onto the image plane sharing the same 2D position or similar sampling points in the feature map, resulting in depth ambiguity. In this paper, we present a novel context and geometry aware voxel transformer. It utilizes a context aware query generator to initialize context-dependent queries tailored to individual input images, effectively capturing their unique characteristics and aggregating information within the region of interest. Furthermore, it extend deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates. Building upon this module, we introduce a neural network named CGFormer to achieve semantic scene completion. Simultaneously, CGFormer leverages multiple 3D representations (i.e., voxel and TPV) to boost the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. Experimental results demonstrate that CGFormer achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks, attaining a mIoU of 16.87 and 20.05, as well as an IoU of 45.99 and 48.07, respectively. Remarkably, CGFormer even outperforms approaches employing temporal images as inputs or much larger image backbone networks.

Results

TaskDatasetMetricValueModel
ReconstructionKITTI-360mIoU20.05CGFormer
ReconstructionSemanticKITTImIoU16.63CGFormer
3D ReconstructionKITTI-360mIoU20.05CGFormer
3D ReconstructionSemanticKITTImIoU16.63CGFormer
3DKITTI-360mIoU20.05CGFormer
3DSemanticKITTImIoU16.63CGFormer
3D Semantic Scene CompletionKITTI-360mIoU20.05CGFormer
3D Semantic Scene CompletionSemanticKITTImIoU16.63CGFormer
3D Scene ReconstructionKITTI-360mIoU20.05CGFormer
3D Scene ReconstructionSemanticKITTImIoU16.63CGFormer
Single-View 3D ReconstructionKITTI-360mIoU20.05CGFormer
Single-View 3D ReconstructionSemanticKITTImIoU16.63CGFormer

Related Papers

Monocular Occupancy Prediction for Scalable Indoor Scenes2024-07-16NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space2023-09-26Symphonize 3D Semantic Scene Completion with Contextual Instance Queries2023-06-27OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction2023-04-11VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion2023-02-23MonoScene: Monocular 3D Semantic Scene Completion2021-12-01Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion2020-12-07LMSCNet: Lightweight Multiscale 3D Semantic Completion2020-08-24