TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MeMViT: Memory-Augmented Multiscale Vision Transformer for...

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Chao-yuan Wu, Yanghao Li, Karttikeya Mangalam, Haoqi Fan, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer

2022-01-20CVPR 2022 1Action ClassificationVideo RecognitionAction AnticipationAction Recognition
PaperPDFCode(official)

Abstract

While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models are available at https://github.com/facebookresearch/memvit.

Results

TaskDatasetMetricValueModel
Activity RecognitionEPIC-KITCHENS-100Action@148.4MeMViT-24
Activity RecognitionEPIC-KITCHENS-100Noun@160.3MeMViT-24
Activity RecognitionEPIC-KITCHENS-100Verb@171.4MeMViT-24
Activity RecognitionAVA v2.2mAP35.4MeMViT-24
Activity RecognitionEPIC-KITCHENS-100Recall@517.7MeMViT-24
Action RecognitionEPIC-KITCHENS-100Action@148.4MeMViT-24
Action RecognitionEPIC-KITCHENS-100Noun@160.3MeMViT-24
Action RecognitionEPIC-KITCHENS-100Verb@171.4MeMViT-24
Action RecognitionAVA v2.2mAP35.4MeMViT-24
Action RecognitionEPIC-KITCHENS-100Recall@517.7MeMViT-24
Action AnticipationEPIC-KITCHENS-100Recall@517.7MeMViT-24
2D Human Pose EstimationEPIC-KITCHENS-100Recall@517.7MeMViT-24
Action Recognition In VideosEPIC-KITCHENS-100Recall@517.7MeMViT-24

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22