Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, LiMin Wang
Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-400 | Acc@1 | 82.2 | AMD(ViT-B/16) |
| Video | Kinetics-400 | Acc@5 | 95.3 | AMD(ViT-B/16) |
| Video | Kinetics-400 | Parameters (M) | 87 | AMD(ViT-B/16) |
| Video | Kinetics-400 | Acc@1 | 80.1 | AMD(ViT-S/16) |
| Video | Kinetics-400 | Acc@5 | 94.5 | AMD(ViT-S/16) |
| Video | Kinetics-400 | Parameters (M) | 22 | AMD(ViT-S/16) |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 79.6 | AMD(ViT-B/16) |
| Activity Recognition | Something-Something V2 | Parameters | 87 | AMD(ViT-B/16) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 73.3 | AMD(ViT-B/16) |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 94 | AMD(ViT-B/16) |
| Activity Recognition | Something-Something V2 | Parameters | 22 | AMD(ViT-S/16) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 70.2 | AMD(ViT-S/16) |
| Activity Recognition | Something-Something V2 | Top-5 Accuracy | 92.5 | AMD(ViT-S/16) |
| Activity Recognition | UCF101 | 3-fold Accuracy | 97.1 | AMD(ViT-B/16) |
| Activity Recognition | AVA v2.2 | mAP | 33.5 | AMD(ViT-B/16) |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 79.6 | AMD(ViT-B/16) |
| Action Recognition | Something-Something V2 | Parameters | 87 | AMD(ViT-B/16) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 73.3 | AMD(ViT-B/16) |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 94 | AMD(ViT-B/16) |
| Action Recognition | Something-Something V2 | Parameters | 22 | AMD(ViT-S/16) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 70.2 | AMD(ViT-S/16) |
| Action Recognition | Something-Something V2 | Top-5 Accuracy | 92.5 | AMD(ViT-S/16) |
| Action Recognition | UCF101 | 3-fold Accuracy | 97.1 | AMD(ViT-B/16) |
| Action Recognition | AVA v2.2 | mAP | 33.5 | AMD(ViT-B/16) |