TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Frozen CLIP Models are Efficient Video Learners

Frozen CLIP Models are Efficient Video Learners

Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

2022-08-06Action ClassificationVideo Recognition
PaperPDFCode(official)Code

Abstract

Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos. This enables the video network to benefit from the pretrained image model. However, this requires substantial computation and memory resources for finetuning on videos and the alternative of directly using pretrained image features without finetuning the image backbone leads to subpar results. Fortunately, recent advances in Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route for visual recognition tasks. Pretrained on large open-vocabulary image-text pair data, these models learn powerful visual representations with rich semantics. In this paper, we present Efficient Video Learning (EVL) -- an efficient framework for directly training high-quality video recognition models with frozen CLIP features. Specifically, we employ a lightweight Transformer decoder and learn a query token to dynamically collect frame-level spatial features from the CLIP image encoder. Furthermore, we adopt a local temporal module in each decoder layer to discover temporal clues from adjacent frames and their attention maps. We show that despite being efficient to train with a frozen backbone, our models learn high quality video representations on a variety of video recognition datasets. Code is available at https://github.com/OpenGVLab/efficient-video-recognition.

Results

TaskDatasetMetricValueModel
VideoKinetics-400Acc@187.7EVL (CLIP ViT-L/14@336px, frozen, 32 frames)
VideoKinetics-400Acc@597.8EVL (CLIP ViT-L/14@336px, frozen, 32 frames)

Related Papers

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis2025-06-09From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos2025-06-05Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition2025-05-29SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding2025-05-22Mouse Lockbox Dataset: Behavior Recognition for Mice Solving Lockboxes2025-05-21Domain Adaptation of VLM for Soccer Video Understanding2025-05-20VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models2025-05-13