TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Revisiting Temporal Modeling for CLIP-based Image-to-Video...

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring

Ruyang Liu, Jingjia Huang, Ge Li, Jiashi Feng, Xinglong Wu, Thomas H. Li

2023-01-26CVPR 2023 1Video RetrievalRepresentation LearningVideo-Text RetrievalVideo RecognitionText RetrievalRetrieval
PaperPDFCode(official)

Abstract

Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at https://github.com/farewellthree/STAN

Results

TaskDatasetMetricValueModel
VideoMSR-VTT-1kAtext-to-video Median Rank1STAN
VideoMSR-VTT-1kAtext-to-video R@154.1STAN
VideoMSR-VTT-1kAtext-to-video R@1087.8STAN
VideoMSR-VTT-1kAtext-to-video R@579.5STAN
VideoDiDeMotext-to-video Median Rank1STAN
VideoDiDeMotext-to-video R@154.6STAN
VideoDiDeMotext-to-video R@1085.1STAN
VideoDiDeMotext-to-video R@578.4STAN
VideoLSMDCtext-to-video Median Rank6STAN
VideoLSMDCtext-to-video R@129.2STAN
VideoLSMDCtext-to-video R@1058.8STAN
VideoLSMDCtext-to-video R@549.5STAN
Video RetrievalMSR-VTT-1kAtext-to-video Median Rank1STAN
Video RetrievalMSR-VTT-1kAtext-to-video R@154.1STAN
Video RetrievalMSR-VTT-1kAtext-to-video R@1087.8STAN
Video RetrievalMSR-VTT-1kAtext-to-video R@579.5STAN
Video RetrievalDiDeMotext-to-video Median Rank1STAN
Video RetrievalDiDeMotext-to-video R@154.6STAN
Video RetrievalDiDeMotext-to-video R@1085.1STAN
Video RetrievalDiDeMotext-to-video R@578.4STAN
Video RetrievalLSMDCtext-to-video Median Rank6STAN
Video RetrievalLSMDCtext-to-video R@129.2STAN
Video RetrievalLSMDCtext-to-video R@1058.8STAN
Video RetrievalLSMDCtext-to-video R@549.5STAN

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16