TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Spatio-temporal Self-Supervised Representation Learning fo...

Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds

Siyuan Huang, Yichen Xie, Song-Chun Zhu, Yixin Zhu

2021-09-01ICCV 2021 103D Point Cloud Linear ClassificationRepresentation LearningData AugmentationScene UnderstandingSemantic Segmentation3D Semantic Segmentationobject-detection3D Shape Classification3D Point Cloud Classification3D Object DetectionObject Detection
PaperPDFCode(official)

Abstract

To date, various 3D scene understanding tasks still lack practical and generalizable pre-trained models, primarily due to the intricate nature of 3D scene understanding tasks and their immense variations introduced by camera views, lighting, occlusions, etc. In this paper, we tackle this challenge by introducing a spatio-temporal representation learning (STRL) framework, capable of learning from unlabeled 3D point clouds in a self-supervised fashion. Inspired by how infants learn from visual data in the wild, we explore the rich spatio-temporal cues derived from the 3D data. Specifically, STRL takes two temporally-correlated frames from a 3D point cloud sequence as the input, transforms it with the spatial data augmentation, and learns the invariant representation self-supervisedly. To corroborate the efficacy of STRL, we conduct extensive experiments on three types (synthetic, indoor, and outdoor) of datasets. Experimental results demonstrate that, compared with supervised learning methods, the learned self-supervised representation facilitates various models to attain comparable or even better performances while capable of generalizing pre-trained models to downstream tasks, including 3D shape classification, 3D object detection, and 3D semantic segmentation. Moreover, the spatio-temporal contextual cues embedded in 3D point clouds significantly improve the learned representations.

Results

TaskDatasetMetricValueModel
Object DetectionSUN-RGBDmAP@0.2559.2STRL + VoteNet ShapeNet_Pretrain
Object DetectionSUN-RGBDmAP@0.2558.2STRL + VoteNet
3DSUN-RGBDmAP@0.2559.2STRL + VoteNet ShapeNet_Pretrain
3DSUN-RGBDmAP@0.2558.2STRL + VoteNet
Shape Representation Of 3D Point CloudsModelNet40Overall Accuracy93.1STRL + DGCNN
3D Object DetectionSUN-RGBDmAP@0.2559.2STRL + VoteNet ShapeNet_Pretrain
3D Object DetectionSUN-RGBDmAP@0.2558.2STRL + VoteNet
3D Point Cloud ClassificationModelNet40Overall Accuracy93.1STRL + DGCNN
2D ClassificationSUN-RGBDmAP@0.2559.2STRL + VoteNet ShapeNet_Pretrain
2D ClassificationSUN-RGBDmAP@0.2558.2STRL + VoteNet
2D Object DetectionSUN-RGBDmAP@0.2559.2STRL + VoteNet ShapeNet_Pretrain
2D Object DetectionSUN-RGBDmAP@0.2558.2STRL + VoteNet
3D Point Cloud Linear ClassificationModelNet40Overall Accuracy90.9STRL
3D Point Cloud ReconstructionModelNet40Overall Accuracy93.1STRL + DGCNN
16kSUN-RGBDmAP@0.2559.2STRL + VoteNet ShapeNet_Pretrain
16kSUN-RGBDmAP@0.2558.2STRL + VoteNet

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection2025-07-17Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models2025-07-17