TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/$R^2$-Tuning: Efficient Image-to-Video Transfer Learning f...

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, Chang Wen Chen

2024-03-31Transfer LearningVideo SummarizationHighlight DetectionMoment RetrievalVideo UnderstandingNatural Language Queries
PaperPDFCode(official)

Abstract

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning ($R^2$-Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight $R^2$ Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, $R^2$ Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. $R^2$-Tuning achieves state-of-the-art performance across three VTG tasks (i.e., moment retrieval, highlight detection, and video summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA, Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.

Results

TaskDatasetMetricValueModel
Moment RetrievalQVHighlightsR@1 IoU=0.568.03R^2-Tuning
Moment RetrievalQVHighlightsR@1 IoU=0.749.35R^2-Tuning
Moment RetrievalQVHighlightsmAP46.17R^2-Tuning
Moment RetrievalQVHighlightsmAP@0.569.04R^2-Tuning
Moment RetrievalQVHighlightsmAP@0.7547.56R^2-Tuning
Highlight DetectionQVHighlightsHit@164.2R^2-Tuning
Highlight DetectionQVHighlightsmAP40.75R^2-Tuning
16kQVHighlightsHit@164.2R^2-Tuning
16kQVHighlightsmAP40.75R^2-Tuning

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16Robust-Multi-Task Gradient Boosting2025-07-15UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14