TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Asymmetric Masked Distillation for Pre-Training Small Foun...

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, LiMin Wang

2023-11-06CVPR 2024 1Image ClassificationAction ClassificationModel CompressionAction RecognitionKnowledge Distillation
PaperPDF

Abstract

Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@182.2AMD(ViT-B/16)
VideoKinetics-400Acc@595.3AMD(ViT-B/16)
VideoKinetics-400Parameters (M)87AMD(ViT-B/16)
VideoKinetics-400Acc@180.1AMD(ViT-S/16)
VideoKinetics-400Acc@594.5AMD(ViT-S/16)
VideoKinetics-400Parameters (M)22AMD(ViT-S/16)
Activity RecognitionHMDB-51Average accuracy of 3 splits79.6AMD(ViT-B/16)
Activity RecognitionSomething-Something V2Parameters87AMD(ViT-B/16)
Activity RecognitionSomething-Something V2Top-1 Accuracy73.3AMD(ViT-B/16)
Activity RecognitionSomething-Something V2Top-5 Accuracy94AMD(ViT-B/16)
Activity RecognitionSomething-Something V2Parameters22AMD(ViT-S/16)
Activity RecognitionSomething-Something V2Top-1 Accuracy70.2AMD(ViT-S/16)
Activity RecognitionSomething-Something V2Top-5 Accuracy92.5AMD(ViT-S/16)
Activity RecognitionUCF1013-fold Accuracy97.1AMD(ViT-B/16)
Activity RecognitionAVA v2.2mAP33.5AMD(ViT-B/16)
Action RecognitionHMDB-51Average accuracy of 3 splits79.6AMD(ViT-B/16)
Action RecognitionSomething-Something V2Parameters87AMD(ViT-B/16)
Action RecognitionSomething-Something V2Top-1 Accuracy73.3AMD(ViT-B/16)
Action RecognitionSomething-Something V2Top-5 Accuracy94AMD(ViT-B/16)
Action RecognitionSomething-Something V2Parameters22AMD(ViT-S/16)
Action RecognitionSomething-Something V2Top-1 Accuracy70.2AMD(ViT-S/16)
Action RecognitionSomething-Something V2Top-5 Accuracy92.5AMD(ViT-S/16)
Action RecognitionUCF1013-fold Accuracy97.1AMD(ViT-B/16)
Action RecognitionAVA v2.2mAP33.5AMD(ViT-B/16)

Related Papers

LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression2025-07-21Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17