TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Side4Video: Spatial-Temporal Side Network for Memory-Effic...

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

Huanjin Yao, Wenhao Wu, Zhiheng Li

2023-11-27Video RetrievalAction ClassificationTransfer LearningVideo UnderstandingAction Recognition
PaperPDFCodeCode(official)

Abstract

Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Mean Rank12.8Side4Video
VideoMSR-VTT-1kAtext-to-video Median Rank1Side4Video
VideoMSR-VTT-1kAtext-to-video R@152.3Side4Video
VideoMSR-VTT-1kAtext-to-video R@1084.2Side4Video
VideoMSR-VTT-1kAtext-to-video R@575.5Side4Video
VideoVATEXtext-to-video MedianR2.7Side4Video
VideoVATEXtext-to-video R@168.8Side4Video
VideoVATEXtext-to-video R@1097Side4Video
VideoVATEXtext-to-video R@593.5Side4Video
VideoVATEXtext-to-video R@501Side4Video
VideoMSVDtext-to-video Mean Rank8.4Side4Video
VideoMSVDtext-to-video Median Rank1Side4Video
VideoMSVDtext-to-video R@156.1Side4Video
VideoMSVDtext-to-video R@1088.8Side4Video
VideoMSVDtext-to-video R@581.7Side4Video
VideoKinetics-400Acc@188.6Side4Video (EVA, ViT-E/14)
VideoKinetics-400Acc@598.2Side4Video (EVA, ViT-E/14)
Activity RecognitionSomething-Something V1Top 1 Accuracy67.3Side4Video (EVA ViT-E/14
Activity RecognitionSomething-Something V1Top 5 Accuracy88.8Side4Video (EVA ViT-E/14
Activity RecognitionSomething-Something V2Top-1 Accuracy75.2Side4Video (EVA ViT-E/14)
Activity RecognitionSomething-Something V2Top-5 Accuracy94Side4Video (EVA ViT-E/14)
Action RecognitionSomething-Something V1Top 1 Accuracy67.3Side4Video (EVA ViT-E/14
Action RecognitionSomething-Something V1Top 5 Accuracy88.8Side4Video (EVA ViT-E/14
Action RecognitionSomething-Something V2Top-1 Accuracy75.2Side4Video (EVA ViT-E/14)
Action RecognitionSomething-Something V2Top-5 Accuracy94Side4Video (EVA ViT-E/14)
Video RetrievalMSR-VTT-1kAtext-to-video Mean Rank12.8Side4Video
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank1Side4Video
Video RetrievalMSR-VTT-1kAtext-to-video R@152.3Side4Video
Video RetrievalMSR-VTT-1kAtext-to-video R@1084.2Side4Video
Video RetrievalMSR-VTT-1kAtext-to-video R@575.5Side4Video
Video RetrievalVATEXtext-to-video MedianR2.7Side4Video
Video RetrievalVATEXtext-to-video R@168.8Side4Video
Video RetrievalVATEXtext-to-video R@1097Side4Video
Video RetrievalVATEXtext-to-video R@593.5Side4Video
Video RetrievalVATEXtext-to-video R@501Side4Video
Video RetrievalMSVDtext-to-video Mean Rank8.4Side4Video
Video RetrievalMSVDtext-to-video Median Rank1Side4Video
Video RetrievalMSVDtext-to-video R@156.1Side4Video
Video RetrievalMSVDtext-to-video R@1088.8Side4Video
Video RetrievalMSVDtext-to-video R@581.7Side4Video

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16Robust-Multi-Task Gradient Boosting2025-07-15UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14