TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Dual-path Adaptation from Image to Video Transformers

Dual-path Adaptation from Image to Video Transformers

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

2023-03-17CVPR 2023 1Activity Recognition In VideosAction ClassificationVideo UnderstandingAction RecognitionAction Recognition In VideosActivity Recognition
PaperPDFCode(official)

Abstract

In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. Previous adaptation methods have simultaneously considered spatial and temporal modeling with a unified learnable module but still suffered from fully leveraging the representative capabilities of image transformers. We argue that the popular dual-path (two-stream) architecture in video models can mitigate this problem. We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block. Especially for temporal dynamic modeling, we incorporate consecutive frames into a grid-like frameset to precisely imitate vision transformers' capability that extrapolates relationships between tokens. In addition, we extensively investigate the multiple baselines from a unified perspective in video understanding and compare them with DualPath. Experimental results on four action recognition benchmarks prove that pretrained image transformers with DualPath can be effectively generalized beyond the data domain.

Results

TaskDatasetMetricValueModel
VideoHMDB51Acc@175.6DualPath w/ ViT-B/16 MLPs.
VideoDiving-48Acc@188.7DualPath w/ ViT-B/16
VideoKinetics-400Acc@187.7DualPath w/ ViT-L/14
VideoKinetics-400Acc@597.8DualPath w/ ViT-L/14
VideoKinetics-400Acc@185.4DualPath w/ ViT-B/16
VideoKinetics-400Acc@597.1DualPath w/ ViT-B/16
Activity RecognitionDiving-48Accuracy88.7DUALPATH
Action RecognitionDiving-48Accuracy88.7DUALPATH

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08Omni-Video: Democratizing Unified Video Understanding and Generation2025-07-08