TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Revealing the Dark Secrets of Masked Image Modeling

Revealing the Dark Secrets of Masked Image Modeling

Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, Yue Cao

2022-05-26CVPR 2023 1Visual Object TrackingPose EstimationObject TrackingDepth EstimationVideo Object TrackingMonocular Depth Estimation
PaperPDFCode(official)

Abstract

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.

Results

TaskDatasetMetricValueModel
Depth EstimationNYU-Depth V2RMS0.287SwinV2-L 1K-MIM
Depth EstimationNYU-Depth V2RMS0.304SwinV2-B 1K-MIM
Depth EstimationNYU-Depth V2Delta < 1.250.949SwinV2-L 1K-MIM
Depth EstimationNYU-Depth V2Delta < 1.25^20.994SwinV2-L 1K-MIM
Depth EstimationNYU-Depth V2Delta < 1.25^30.999SwinV2-L 1K-MIM
Depth EstimationNYU-Depth V2RMSE0.287SwinV2-L 1K-MIM
Depth EstimationNYU-Depth V2absolute relative error0.083SwinV2-L 1K-MIM
Depth EstimationNYU-Depth V2log 100.035SwinV2-L 1K-MIM
Depth EstimationKITTI Eigen splitDelta < 1.250.977SwinV2-L 1K-MIM
Depth EstimationKITTI Eigen splitDelta < 1.25^20.998SwinV2-L 1K-MIM
Depth EstimationKITTI Eigen splitDelta < 1.25^31SwinV2-L 1K-MIM
Depth EstimationKITTI Eigen splitRMSE1.966SwinV2-L 1K-MIM
Depth EstimationKITTI Eigen splitRMSE log0.075SwinV2-L 1K-MIM
Depth EstimationKITTI Eigen splitSq Rel0.139SwinV2-L 1K-MIM
Depth EstimationKITTI Eigen splitabsolute relative error0.05SwinV2-L 1K-MIM
Depth EstimationKITTI Eigen splitDelta < 1.250.976SwinV2-B 1K-MIM
Depth EstimationKITTI Eigen splitDelta < 1.25^20.998SwinV2-B 1K-MIM
Depth EstimationKITTI Eigen splitDelta < 1.25^30.999SwinV2-B 1K-MIM
Depth EstimationKITTI Eigen splitRMSE2.05SwinV2-B 1K-MIM
Depth EstimationKITTI Eigen splitRMSE log0.078SwinV2-B 1K-MIM
Depth EstimationKITTI Eigen splitSq Rel0.148SwinV2-B 1K-MIM
Depth EstimationKITTI Eigen splitabsolute relative error0.052SwinV2-B 1K-MIM
Pose EstimationCOCO test-devAP77.2SwinV2-L 1K-MIM
Pose EstimationCOCO test-devAP76.7SwinV2-B 1K-MIM
Pose EstimationCrowdPoseAP75.5SwinV2-L 1K-MIM
Pose EstimationCrowdPoseAP74.9SwinV2-B 1K-MIM
Object TrackingLaSOTAUC70.7SwinV2-L 1K-MIM
Object TrackingLaSOTAUC70SwinV2-B 1K-MIM
Object TrackingGOT-10kAverage Overlap72.9SwinV2-L 1K-MIM
Object TrackingGOT-10kAverage Overlap70.8SwinV2-B 1K-MIM
3DCOCO test-devAP77.2SwinV2-L 1K-MIM
3DCOCO test-devAP76.7SwinV2-B 1K-MIM
3DCrowdPoseAP75.5SwinV2-L 1K-MIM
3DCrowdPoseAP74.9SwinV2-B 1K-MIM
3DNYU-Depth V2RMS0.287SwinV2-L 1K-MIM
3DNYU-Depth V2RMS0.304SwinV2-B 1K-MIM
3DNYU-Depth V2Delta < 1.250.949SwinV2-L 1K-MIM
3DNYU-Depth V2Delta < 1.25^20.994SwinV2-L 1K-MIM
3DNYU-Depth V2Delta < 1.25^30.999SwinV2-L 1K-MIM
3DNYU-Depth V2RMSE0.287SwinV2-L 1K-MIM
3DNYU-Depth V2absolute relative error0.083SwinV2-L 1K-MIM
3DNYU-Depth V2log 100.035SwinV2-L 1K-MIM
3DKITTI Eigen splitDelta < 1.250.977SwinV2-L 1K-MIM
3DKITTI Eigen splitDelta < 1.25^20.998SwinV2-L 1K-MIM
3DKITTI Eigen splitDelta < 1.25^31SwinV2-L 1K-MIM
3DKITTI Eigen splitRMSE1.966SwinV2-L 1K-MIM
3DKITTI Eigen splitRMSE log0.075SwinV2-L 1K-MIM
3DKITTI Eigen splitSq Rel0.139SwinV2-L 1K-MIM
3DKITTI Eigen splitabsolute relative error0.05SwinV2-L 1K-MIM
3DKITTI Eigen splitDelta < 1.250.976SwinV2-B 1K-MIM
3DKITTI Eigen splitDelta < 1.25^20.998SwinV2-B 1K-MIM
3DKITTI Eigen splitDelta < 1.25^30.999SwinV2-B 1K-MIM
3DKITTI Eigen splitRMSE2.05SwinV2-B 1K-MIM
3DKITTI Eigen splitRMSE log0.078SwinV2-B 1K-MIM
3DKITTI Eigen splitSq Rel0.148SwinV2-B 1K-MIM
3DKITTI Eigen splitabsolute relative error0.052SwinV2-B 1K-MIM
Visual Object TrackingLaSOTAUC70.7SwinV2-L 1K-MIM
Visual Object TrackingLaSOTAUC70SwinV2-B 1K-MIM
Visual Object TrackingGOT-10kAverage Overlap72.9SwinV2-L 1K-MIM
Visual Object TrackingGOT-10kAverage Overlap70.8SwinV2-B 1K-MIM
1 Image, 2*2 StitchiCOCO test-devAP77.2SwinV2-L 1K-MIM
1 Image, 2*2 StitchiCOCO test-devAP76.7SwinV2-B 1K-MIM
1 Image, 2*2 StitchiCrowdPoseAP75.5SwinV2-L 1K-MIM
1 Image, 2*2 StitchiCrowdPoseAP74.9SwinV2-B 1K-MIM

Related Papers

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17From Neck to Head: Bio-Impedance Sensing for Head Pose Estimation2025-07-17AthleticsPose: Authentic Sports Motion Dataset on Athletic Field and Evaluation of Monocular 3D Pose Estimation Ability2025-07-17MVA 2025 Small Multi-Object Tracking for Spotting Birds Challenge: Dataset, Methods, and Results2025-07-17$S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation2025-07-17SpatialTrackerV2: 3D Point Tracking Made Easy2025-07-16