TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Point-M2AE: Multi-scale Masked Autoencoders for Hierarchic...

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Renrui Zhang, Ziyu Guo, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, Hongsheng Li, Peng Gao

2022-05-283D Point Cloud Linear ClassificationRepresentation LearningSelf-Supervised LearningOpen-Ended Question AnsweringFew-Shot 3D Point Cloud Classificationobject-detection3D Point Cloud Classification3D Object DetectionObject Detection
PaperPDFCode(official)CodeCode

Abstract

Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism during fine-tuning to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pre-training, Point-M2AE achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86.43% accuracy on ScanObjectNN, +3.36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code is available at https://github.com/ZrrSkywalker/Point-M2AE.

Results

TaskDatasetMetricValueModel
Shape Representation Of 3D Point CloudsScanObjectNNOBJ-BG (OA)91.22Point-M2AE
Shape Representation Of 3D Point CloudsScanObjectNNOBJ-ONLY (OA)88.81Point-M2AE
Shape Representation Of 3D Point CloudsScanObjectNNOverall Accuracy86.43Point-M2AE
Shape Representation Of 3D Point CloudsModelNet40Overall Accuracy94Point-M2AE
Shape Representation Of 3D Point CloudsModelNet40Overall Accuracy92.9Point-M2AE-SVM
Shape Representation Of 3D Point CloudsModelNet40 10-way (20-shot)Overall Accuracy95Point-M2AE
Shape Representation Of 3D Point CloudsModelNet40 10-way (20-shot)Standard Deviation3Point-M2AE
Shape Representation Of 3D Point CloudsModelNet40 5-way (10-shot)Overall Accuracy96.8Point-M2AE
Shape Representation Of 3D Point CloudsModelNet40 5-way (10-shot)Standard Deviation1.8Point-M2AE
Shape Representation Of 3D Point CloudsModelNet40 10-way (10-shot)Overall Accuracy92.3Point-M2AE
Shape Representation Of 3D Point CloudsModelNet40 10-way (10-shot)Standard Deviation4.5Point-M2AE
Shape Representation Of 3D Point CloudsModelNet40 5-way (20-shot)Overall Accuracy98.3Point-M2AE
Shape Representation Of 3D Point CloudsModelNet40 5-way (20-shot)Standard Deviation1.4Point-M2AE
3D Point Cloud ClassificationScanObjectNNOBJ-BG (OA)91.22Point-M2AE
3D Point Cloud ClassificationScanObjectNNOBJ-ONLY (OA)88.81Point-M2AE
3D Point Cloud ClassificationScanObjectNNOverall Accuracy86.43Point-M2AE
3D Point Cloud ClassificationModelNet40Overall Accuracy94Point-M2AE
3D Point Cloud ClassificationModelNet40Overall Accuracy92.9Point-M2AE-SVM
3D Point Cloud ClassificationModelNet40 10-way (20-shot)Overall Accuracy95Point-M2AE
3D Point Cloud ClassificationModelNet40 10-way (20-shot)Standard Deviation3Point-M2AE
3D Point Cloud ClassificationModelNet40 5-way (10-shot)Overall Accuracy96.8Point-M2AE
3D Point Cloud ClassificationModelNet40 5-way (10-shot)Standard Deviation1.8Point-M2AE
3D Point Cloud ClassificationModelNet40 10-way (10-shot)Overall Accuracy92.3Point-M2AE
3D Point Cloud ClassificationModelNet40 10-way (10-shot)Standard Deviation4.5Point-M2AE
3D Point Cloud ClassificationModelNet40 5-way (20-shot)Overall Accuracy98.3Point-M2AE
3D Point Cloud ClassificationModelNet40 5-way (20-shot)Standard Deviation1.4Point-M2AE
3D Point Cloud Linear ClassificationModelNet40Overall Accuracy92.9Point-M2AE
3D Point Cloud ReconstructionScanObjectNNOBJ-BG (OA)91.22Point-M2AE
3D Point Cloud ReconstructionScanObjectNNOBJ-ONLY (OA)88.81Point-M2AE
3D Point Cloud ReconstructionScanObjectNNOverall Accuracy86.43Point-M2AE
3D Point Cloud ReconstructionModelNet40Overall Accuracy94Point-M2AE
3D Point Cloud ReconstructionModelNet40Overall Accuracy92.9Point-M2AE-SVM
3D Point Cloud ReconstructionModelNet40 10-way (20-shot)Overall Accuracy95Point-M2AE
3D Point Cloud ReconstructionModelNet40 10-way (20-shot)Standard Deviation3Point-M2AE
3D Point Cloud ReconstructionModelNet40 5-way (10-shot)Overall Accuracy96.8Point-M2AE
3D Point Cloud ReconstructionModelNet40 5-way (10-shot)Standard Deviation1.8Point-M2AE
3D Point Cloud ReconstructionModelNet40 10-way (10-shot)Overall Accuracy92.3Point-M2AE
3D Point Cloud ReconstructionModelNet40 10-way (10-shot)Standard Deviation4.5Point-M2AE
3D Point Cloud ReconstructionModelNet40 5-way (20-shot)Overall Accuracy98.3Point-M2AE
3D Point Cloud ReconstructionModelNet40 5-way (20-shot)Standard Deviation1.4Point-M2AE

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17