Revealing the Dark Secrets of Masked Image Modeling

Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han Hu, Yue Cao

2022-05-26CVPR 2023 1Visual Object Tracking Pose Estimation Object Tracking Depth Estimation Video Object Tracking Monocular Depth Estimation

Paper PDF Code(official)

Abstract

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.

Results

Task	Dataset	Metric	Value	Model
Depth Estimation	NYU-Depth V2	RMS	0.287	SwinV2-L 1K-MIM
Depth Estimation	NYU-Depth V2	RMS	0.304	SwinV2-B 1K-MIM
Depth Estimation	NYU-Depth V2	Delta < 1.25	0.949	SwinV2-L 1K-MIM
Depth Estimation	NYU-Depth V2	Delta < 1.25^2	0.994	SwinV2-L 1K-MIM
Depth Estimation	NYU-Depth V2	Delta < 1.25^3	0.999	SwinV2-L 1K-MIM
Depth Estimation	NYU-Depth V2	RMSE	0.287	SwinV2-L 1K-MIM
Depth Estimation	NYU-Depth V2	absolute relative error	0.083	SwinV2-L 1K-MIM
Depth Estimation	NYU-Depth V2	log 10	0.035	SwinV2-L 1K-MIM
Depth Estimation	KITTI Eigen split	Delta < 1.25	0.977	SwinV2-L 1K-MIM
Depth Estimation	KITTI Eigen split	Delta < 1.25^2	0.998	SwinV2-L 1K-MIM
Depth Estimation	KITTI Eigen split	Delta < 1.25^3	1	SwinV2-L 1K-MIM
Depth Estimation	KITTI Eigen split	RMSE	1.966	SwinV2-L 1K-MIM
Depth Estimation	KITTI Eigen split	RMSE log	0.075	SwinV2-L 1K-MIM
Depth Estimation	KITTI Eigen split	Sq Rel	0.139	SwinV2-L 1K-MIM
Depth Estimation	KITTI Eigen split	absolute relative error	0.05	SwinV2-L 1K-MIM
Depth Estimation	KITTI Eigen split	Delta < 1.25	0.976	SwinV2-B 1K-MIM
Depth Estimation	KITTI Eigen split	Delta < 1.25^2	0.998	SwinV2-B 1K-MIM
Depth Estimation	KITTI Eigen split	Delta < 1.25^3	0.999	SwinV2-B 1K-MIM
Depth Estimation	KITTI Eigen split	RMSE	2.05	SwinV2-B 1K-MIM
Depth Estimation	KITTI Eigen split	RMSE log	0.078	SwinV2-B 1K-MIM
Depth Estimation	KITTI Eigen split	Sq Rel	0.148	SwinV2-B 1K-MIM
Depth Estimation	KITTI Eigen split	absolute relative error	0.052	SwinV2-B 1K-MIM
Pose Estimation	COCO test-dev	AP	77.2	SwinV2-L 1K-MIM
Pose Estimation	COCO test-dev	AP	76.7	SwinV2-B 1K-MIM
Pose Estimation	CrowdPose	AP	75.5	SwinV2-L 1K-MIM
Pose Estimation	CrowdPose	AP	74.9	SwinV2-B 1K-MIM
Object Tracking	LaSOT	AUC	70.7	SwinV2-L 1K-MIM
Object Tracking	LaSOT	AUC	70	SwinV2-B 1K-MIM
Object Tracking	GOT-10k	Average Overlap	72.9	SwinV2-L 1K-MIM
Object Tracking	GOT-10k	Average Overlap	70.8	SwinV2-B 1K-MIM
3D	COCO test-dev	AP	77.2	SwinV2-L 1K-MIM
3D	COCO test-dev	AP	76.7	SwinV2-B 1K-MIM
3D	CrowdPose	AP	75.5	SwinV2-L 1K-MIM
3D	CrowdPose	AP	74.9	SwinV2-B 1K-MIM
3D	NYU-Depth V2	RMS	0.287	SwinV2-L 1K-MIM
3D	NYU-Depth V2	RMS	0.304	SwinV2-B 1K-MIM
3D	NYU-Depth V2	Delta < 1.25	0.949	SwinV2-L 1K-MIM
3D	NYU-Depth V2	Delta < 1.25^2	0.994	SwinV2-L 1K-MIM
3D	NYU-Depth V2	Delta < 1.25^3	0.999	SwinV2-L 1K-MIM
3D	NYU-Depth V2	RMSE	0.287	SwinV2-L 1K-MIM
3D	NYU-Depth V2	absolute relative error	0.083	SwinV2-L 1K-MIM
3D	NYU-Depth V2	log 10	0.035	SwinV2-L 1K-MIM
3D	KITTI Eigen split	Delta < 1.25	0.977	SwinV2-L 1K-MIM
3D	KITTI Eigen split	Delta < 1.25^2	0.998	SwinV2-L 1K-MIM
3D	KITTI Eigen split	Delta < 1.25^3	1	SwinV2-L 1K-MIM
3D	KITTI Eigen split	RMSE	1.966	SwinV2-L 1K-MIM
3D	KITTI Eigen split	RMSE log	0.075	SwinV2-L 1K-MIM
3D	KITTI Eigen split	Sq Rel	0.139	SwinV2-L 1K-MIM
3D	KITTI Eigen split	absolute relative error	0.05	SwinV2-L 1K-MIM
3D	KITTI Eigen split	Delta < 1.25	0.976	SwinV2-B 1K-MIM
3D	KITTI Eigen split	Delta < 1.25^2	0.998	SwinV2-B 1K-MIM
3D	KITTI Eigen split	Delta < 1.25^3	0.999	SwinV2-B 1K-MIM
3D	KITTI Eigen split	RMSE	2.05	SwinV2-B 1K-MIM
3D	KITTI Eigen split	RMSE log	0.078	SwinV2-B 1K-MIM
3D	KITTI Eigen split	Sq Rel	0.148	SwinV2-B 1K-MIM
3D	KITTI Eigen split	absolute relative error	0.052	SwinV2-B 1K-MIM
Visual Object Tracking	LaSOT	AUC	70.7	SwinV2-L 1K-MIM
Visual Object Tracking	LaSOT	AUC	70	SwinV2-B 1K-MIM
Visual Object Tracking	GOT-10k	Average Overlap	72.9	SwinV2-L 1K-MIM
Visual Object Tracking	GOT-10k	Average Overlap	70.8	SwinV2-B 1K-MIM
1 Image, 2*2 Stitchi	COCO test-dev	AP	77.2	SwinV2-L 1K-MIM
1 Image, 2*2 Stitchi	COCO test-dev	AP	76.7	SwinV2-B 1K-MIM
1 Image, 2*2 Stitchi	CrowdPose	AP	75.5	SwinV2-L 1K-MIM
1 Image, 2*2 Stitchi	CrowdPose	AP	74.9	SwinV2-B 1K-MIM

Revealing the Dark Secrets of Masked Image Modeling

Abstract

Results

Related Papers

Revealing the Dark Secrets of Masked Image Modeling

Abstract

Results

Related Papers