TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

Bin Wang, Wenqian Wang

2024-08-20Transfer Learning parameter-efficient fine-tuning Action Recognition Temporal Action Localization

Abstract

Recently, large-scale pre-trained vision-language models (e.g., CLIP), have garnered significant attention thanks to their powerful representative capabilities. This inspires researchers in transferring the knowledge from these large pre-trained models to other task-specific models, e.g., Video Action Recognition (VAR) models, via particularly leveraging side networks to enhance the efficiency of parameter-efficient fine-tuning (PEFT). However, current transferring approaches in VAR tend to directly transfer the frozen knowledge from large pre-trained models to action recognition networks with minimal cost, instead of exploiting the temporal modeling capabilities of the action recognition models themselves. Therefore, in this paper, we propose a memory-efficient Temporal Difference Side Network (TDS-CLIP) to balance knowledge transferring and temporal modeling, avoiding backpropagation in frozen parameter models. Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features to strengthen the model's global temporal modeling capabilities. Furthermore, we designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos, thereby improving the side network's ability to capture and learn motion information. Extensive experiments are conducted on three benchmark datasets, including Something-Something V1\&V2, and Kinetics-400. Experimental results demonstrate that our approach achieves competitive performance.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Something-Something V1	Top 1 Accuracy	63	TDS-CLIP-ViT-L/14(8frames)
Activity Recognition	Something-Something V1	Top 5 Accuracy	87.8	TDS-CLIP-ViT-L/14(8frames)
Activity Recognition	Something-Something V2	Top-1 Accuracy	73.4	TDS-CLIP-ViT-L/14(8frames)
Activity Recognition	Something-Something V2	Top-5 Accuracy	93.8	TDS-CLIP-ViT-L/14(8frames)
Action Recognition	Something-Something V1	Top 1 Accuracy	63	TDS-CLIP-ViT-L/14(8frames)
Action Recognition	Something-Something V1	Top 5 Accuracy	87.8	TDS-CLIP-ViT-L/14(8frames)
Action Recognition	Something-Something V2	Top-1 Accuracy	73.4	TDS-CLIP-ViT-L/14(8frames)
Action Recognition	Something-Something V2	Top-5 Accuracy	93.8	TDS-CLIP-ViT-L/14(8frames)

TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

Abstract

Results

Related Papers

TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

Abstract

Results

Related Papers