TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Im...

Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?

Runpei Dong, Zekun Qi, Linfeng Zhang, Junbo Zhang, Jianjian Sun, Zheng Ge, Li Yi, Kaisheng Ma

2022-12-16Representation LearningFew-Shot 3D Point Cloud ClassificationKnowledge Distillation3D Point Cloud Classification
PaperPDFCodeCodeCodeCode(official)

Abstract

The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN. Codes have been released at https://github.com/RunpeiDong/ACT.

Results

TaskDatasetMetricValueModel
Shape Representation Of 3D Point CloudsScanObjectNNOverall Accuracy89.17ACT
Shape Representation Of 3D Point CloudsScanObjectNNOBJ-BG (OA)93.29ACT (no voting)
Shape Representation Of 3D Point CloudsScanObjectNNOBJ-ONLY (OA)91.91ACT (no voting)
Shape Representation Of 3D Point CloudsScanObjectNNOverall Accuracy88.21ACT (no voting)
Shape Representation Of 3D Point CloudsModelNet40 10-way (20-shot)Overall Accuracy95.6ACT
Shape Representation Of 3D Point CloudsModelNet40 10-way (20-shot)Standard Deviation2.8ACT
Shape Representation Of 3D Point CloudsModelNet40 5-way (10-shot)Overall Accuracy96.8ACT
Shape Representation Of 3D Point CloudsModelNet40 5-way (10-shot)Standard Deviation2.3ACT
Shape Representation Of 3D Point CloudsModelNet40 10-way (10-shot)Overall Accuracy93.3ACT
Shape Representation Of 3D Point CloudsModelNet40 10-way (10-shot)Standard Deviation4ACT
Shape Representation Of 3D Point CloudsModelNet40 5-way (20-shot)Overall Accuracy98ACT
Shape Representation Of 3D Point CloudsModelNet40 5-way (20-shot)Standard Deviation1.4ACT
3D Point Cloud ClassificationScanObjectNNOverall Accuracy89.17ACT
3D Point Cloud ClassificationScanObjectNNOBJ-BG (OA)93.29ACT (no voting)
3D Point Cloud ClassificationScanObjectNNOBJ-ONLY (OA)91.91ACT (no voting)
3D Point Cloud ClassificationScanObjectNNOverall Accuracy88.21ACT (no voting)
3D Point Cloud ClassificationModelNet40 10-way (20-shot)Overall Accuracy95.6ACT
3D Point Cloud ClassificationModelNet40 10-way (20-shot)Standard Deviation2.8ACT
3D Point Cloud ClassificationModelNet40 5-way (10-shot)Overall Accuracy96.8ACT
3D Point Cloud ClassificationModelNet40 5-way (10-shot)Standard Deviation2.3ACT
3D Point Cloud ClassificationModelNet40 10-way (10-shot)Overall Accuracy93.3ACT
3D Point Cloud ClassificationModelNet40 10-way (10-shot)Standard Deviation4ACT
3D Point Cloud ClassificationModelNet40 5-way (20-shot)Overall Accuracy98ACT
3D Point Cloud ClassificationModelNet40 5-way (20-shot)Standard Deviation1.4ACT
3D Point Cloud ReconstructionScanObjectNNOverall Accuracy89.17ACT
3D Point Cloud ReconstructionScanObjectNNOBJ-BG (OA)93.29ACT (no voting)
3D Point Cloud ReconstructionScanObjectNNOBJ-ONLY (OA)91.91ACT (no voting)
3D Point Cloud ReconstructionScanObjectNNOverall Accuracy88.21ACT (no voting)
3D Point Cloud ReconstructionModelNet40 10-way (20-shot)Overall Accuracy95.6ACT
3D Point Cloud ReconstructionModelNet40 10-way (20-shot)Standard Deviation2.8ACT
3D Point Cloud ReconstructionModelNet40 5-way (10-shot)Overall Accuracy96.8ACT
3D Point Cloud ReconstructionModelNet40 5-way (10-shot)Standard Deviation2.3ACT
3D Point Cloud ReconstructionModelNet40 10-way (10-shot)Overall Accuracy93.3ACT
3D Point Cloud ReconstructionModelNet40 10-way (10-shot)Standard Deviation4ACT
3D Point Cloud ReconstructionModelNet40 5-way (20-shot)Overall Accuracy98ACT
3D Point Cloud ReconstructionModelNet40 5-way (20-shot)Standard Deviation1.4ACT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16