TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers ...

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Xinhao Li, Yuhan Zhu, LiMin Wang

2023-10-02Action ClassificationVideo RecognitionAction Recognition
PaperPDFCode(official)Code(official)

Abstract

Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus toward parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational costs to deal with the domain gap and temporal modeling in videos. In this paper, we present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the original models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of image-to-video adaptation, we exploit the flexibility of self-attention and introduce spatial-temporal dual-headed attention (STDHA). This approach efficiently endows the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy that utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Thanks to the customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, enabling zero extra cost during inference. Extensive experiments on representative fully-supervised and few-shot video recognition benchmarks showcase that ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@187.2ZeroI2V ViT-L/14
VideoKinetics-400Acc@597.6ZeroI2V ViT-L/14
Activity RecognitionHMDB-51Average accuracy of 3 splits83.4ZeroI2V ViT-L/14
Activity RecognitionSomething-Something V2Top-1 Accuracy72.2ZeroI2V ViT-L/14
Activity RecognitionSomething-Something V2Top-5 Accuracy93ZeroI2V ViT-L/14
Activity RecognitionUCF1013-fold Accuracy98.6ZeroI2V ViT-L/14
Action RecognitionHMDB-51Average accuracy of 3 splits83.4ZeroI2V ViT-L/14
Action RecognitionSomething-Something V2Top-1 Accuracy72.2ZeroI2V ViT-L/14
Action RecognitionSomething-Something V2Top-5 Accuracy93ZeroI2V ViT-L/14
Action RecognitionUCF1013-fold Accuracy98.6ZeroI2V ViT-L/14

Related Papers

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Zero-shot Skeleton-based Action Recognition with Prototype-guided Feature Alignment2025-07-01EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception2025-06-26Feature Hallucination for Self-supervised Action Recognition2025-06-25CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition2025-06-25Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Adapting Vision-Language Models for Evaluating World Models2025-06-22