TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TDS-CLIP: Temporal Difference Side Network for Image-to-Vi...

TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

Bin Wang, Wenqian Wang

2024-08-20Transfer Learningparameter-efficient fine-tuningAction RecognitionTemporal Action Localization
PaperPDFCode(official)

Abstract

Recently, large-scale pre-trained vision-language models (e.g., CLIP), have garnered significant attention thanks to their powerful representative capabilities. This inspires researchers in transferring the knowledge from these large pre-trained models to other task-specific models, e.g., Video Action Recognition (VAR) models, via particularly leveraging side networks to enhance the efficiency of parameter-efficient fine-tuning (PEFT). However, current transferring approaches in VAR tend to directly transfer the frozen knowledge from large pre-trained models to action recognition networks with minimal cost, instead of exploiting the temporal modeling capabilities of the action recognition models themselves. Therefore, in this paper, we propose a memory-efficient Temporal Difference Side Network (TDS-CLIP) to balance knowledge transferring and temporal modeling, avoiding backpropagation in frozen parameter models. Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features to strengthen the model's global temporal modeling capabilities. Furthermore, we designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos, thereby improving the side network's ability to capture and learn motion information. Extensive experiments are conducted on three benchmark datasets, including Something-Something V1\&V2, and Kinetics-400. Experimental results demonstrate that our approach achieves competitive performance.

Results

TaskDatasetMetricValueModel
Activity RecognitionSomething-Something V1Top 1 Accuracy63TDS-CLIP-ViT-L/14(8frames)
Activity RecognitionSomething-Something V1Top 5 Accuracy87.8TDS-CLIP-ViT-L/14(8frames)
Activity RecognitionSomething-Something V2Top-1 Accuracy73.4TDS-CLIP-ViT-L/14(8frames)
Activity RecognitionSomething-Something V2Top-5 Accuracy93.8TDS-CLIP-ViT-L/14(8frames)
Action RecognitionSomething-Something V1Top 1 Accuracy63TDS-CLIP-ViT-L/14(8frames)
Action RecognitionSomething-Something V1Top 5 Accuracy87.8TDS-CLIP-ViT-L/14(8frames)
Action RecognitionSomething-Something V2Top-1 Accuracy73.4TDS-CLIP-ViT-L/14(8frames)
Action RecognitionSomething-Something V2Top-5 Accuracy93.8TDS-CLIP-ViT-L/14(8frames)

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Robust-Multi-Task Gradient Boosting2025-07-15Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift2025-07-12