TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MAR: Masked Autoencoders for Efficient Action Recognition

MAR: Masked Autoencoders for Efficient Action Recognition

Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Xiang Wang, Yuehuan Wang, Yiliang Lv, Changxin Gao, Nong Sang

2022-07-24Action ClassificationVideo RecognitionAction Recognition
PaperPDFCode(official)

Abstract

Standard approaches for video recognition usually operate on the full input videos, which is inefficient due to the widely present spatio-temporal redundancy in videos. Recent progress in masked video modelling, i.e., VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts given only limited visual contents. Inspired by this, we propose propose Masked Action Recognition (MAR), which reduces the redundant computation by discarding a proportion of patches and operating only on a part of the videos. MAR contains the following two indispensable components: cell running masking and bridging classifier. Specifically, to enable the ViT to perceive the details beyond the visible patches easily, cell running masking is presented to preserve the spatio-temporal correlations in videos, which ensures the patches at the same spatial location can be observed in turn for easy reconstructions. Additionally, we notice that, although the partially observed features can reconstruct semantically explicit invisible patches, they fail to achieve accurate classification. To address this, a bridging classifier is proposed to bridge the semantic gap between the ViT encoded features for reconstruction and the features specialized for classification. Our proposed MAR reduces the computational cost of ViT by 53% and extensive experiments show that MAR consistently outperforms existing ViT models with a notable margin. Especially, we found a ViT-Large trained by MAR outperforms the ViT-Huge trained by a standard training scheme by convincing margins on both Kinetics-400 and Something-Something v2 datasets, while our computation overhead of ViT-Large is only 14.5% of ViT-Huge.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@185.3MAR (50% mask, ViT-L, 16x4)
VideoKinetics-400Acc@596.3MAR (50% mask, ViT-L, 16x4)
VideoKinetics-400Acc@183.9MAR (75% mask, ViT-L, 16x4)
VideoKinetics-400Acc@596MAR (75% mask, ViT-L, 16x4)
VideoKinetics-400Acc@181MAR (50% mask, ViT-B, 16x4)
VideoKinetics-400Acc@594.4MAR (50% mask, ViT-B, 16x4)
VideoKinetics-400Acc@179.4MAR (75% mask, ViT-B, 16x4)
VideoKinetics-400Acc@593.7MAR (75% mask, ViT-B, 16x4)
Activity RecognitionSomething-Something V2Parameters311MAR (50% mask, ViT-L, 16x4)
Activity RecognitionSomething-Something V2Top-1 Accuracy74.7MAR (50% mask, ViT-L, 16x4)
Activity RecognitionSomething-Something V2Top-5 Accuracy94.9MAR (50% mask, ViT-L, 16x4)
Activity RecognitionSomething-Something V2Parameters311MAR (75% mask, ViT-L, 16x4)
Activity RecognitionSomething-Something V2Top-1 Accuracy73.8MAR (75% mask, ViT-L, 16x4)
Activity RecognitionSomething-Something V2Top-5 Accuracy94.4MAR (75% mask, ViT-L, 16x4)
Activity RecognitionSomething-Something V2Parameters94MAR (50% mask, ViT-B, 16x4)
Activity RecognitionSomething-Something V2Top-1 Accuracy71MAR (50% mask, ViT-B, 16x4)
Activity RecognitionSomething-Something V2Top-5 Accuracy92.8MAR (50% mask, ViT-B, 16x4)
Activity RecognitionSomething-Something V2Parameters94MAR (75% mask, ViT-B, 16x4)
Activity RecognitionSomething-Something V2Top-1 Accuracy69.5MAR (75% mask, ViT-B, 16x4)
Activity RecognitionSomething-Something V2Top-5 Accuracy91.9MAR (75% mask, ViT-B, 16x4)
Action RecognitionSomething-Something V2Parameters311MAR (50% mask, ViT-L, 16x4)
Action RecognitionSomething-Something V2Top-1 Accuracy74.7MAR (50% mask, ViT-L, 16x4)
Action RecognitionSomething-Something V2Top-5 Accuracy94.9MAR (50% mask, ViT-L, 16x4)
Action RecognitionSomething-Something V2Parameters311MAR (75% mask, ViT-L, 16x4)
Action RecognitionSomething-Something V2Top-1 Accuracy73.8MAR (75% mask, ViT-L, 16x4)
Action RecognitionSomething-Something V2Top-5 Accuracy94.4MAR (75% mask, ViT-L, 16x4)
Action RecognitionSomething-Something V2Parameters94MAR (50% mask, ViT-B, 16x4)
Action RecognitionSomething-Something V2Top-1 Accuracy71MAR (50% mask, ViT-B, 16x4)
Action RecognitionSomething-Something V2Top-5 Accuracy92.8MAR (50% mask, ViT-B, 16x4)
Action RecognitionSomething-Something V2Parameters94MAR (75% mask, ViT-B, 16x4)
Action RecognitionSomething-Something V2Top-1 Accuracy69.5MAR (75% mask, ViT-B, 16x4)
Action RecognitionSomething-Something V2Top-5 Accuracy91.9MAR (75% mask, ViT-B, 16x4)

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22