TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DeCafNet: Delegate and Conquer for Efficient Temporal Grou...

DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Zijia Lu, A S M Iftekhar, Gaurav Mittal, Tianjian Meng, Xiawei Wang, Cheng Zhao, Rohith Kukkala, Ehsan Elhamifar, Mei Chen

2025-05-22CVPR 2025 1Video GroundingTemporal Sentence GroundingNatural Language QueriesNatural Language Moment Retrieval
PaperPDFCode(official)

Abstract

Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on two LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47\% while still outperforming existing methods, establishing a new state-of-the-art for LTVG in terms of both efficiency and performance. Our code is available at https://github.com/ZijiaLewisLu/CVPR2025-DeCafNet.

Results

TaskDatasetMetricValueModel
Video UnderstandingEgo4D-GoalstepR@1,IoU=0.323.2DeCafNet-100%
Video UnderstandingEgo4D-GoalstepR@1,IoU=0.519.4DeCafNet-100%
Video UnderstandingEgo4D-GoalstepR@5,IoU=0.351.38DeCafNet-100%
Video UnderstandingEgo4D-GoalstepR@5,IoU=0.544.17DeCafNet-100%
Video UnderstandingEgo4D-GoalstepR@1,IoU=0.321.29DeCafNet-50%
Video UnderstandingEgo4D-GoalstepR@1,IoU=0.517.46DeCafNet-50%
Video UnderstandingEgo4D-GoalstepR@5,IoU=0.347.27DeCafNet-50%
Video UnderstandingEgo4D-GoalstepR@5,IoU=0.540.4DeCafNet-50%
Video UnderstandingCharades-STAR1@0.568.79DeCafNet
Video UnderstandingCharades-STAR1@0.747.55DeCafNet
Video UnderstandingCharades-STAR5@0.591.53DeCafNet
Video UnderstandingCharades-STAR5@0.772.96DeCafNet
VideoEgo4D-GoalstepR@1,IoU=0.323.2DeCafNet-100%
VideoEgo4D-GoalstepR@1,IoU=0.519.4DeCafNet-100%
VideoEgo4D-GoalstepR@5,IoU=0.351.38DeCafNet-100%
VideoEgo4D-GoalstepR@5,IoU=0.544.17DeCafNet-100%
VideoEgo4D-GoalstepR@1,IoU=0.321.29DeCafNet-50%
VideoEgo4D-GoalstepR@1,IoU=0.517.46DeCafNet-50%
VideoEgo4D-GoalstepR@5,IoU=0.347.27DeCafNet-50%
VideoEgo4D-GoalstepR@5,IoU=0.540.4DeCafNet-50%
VideoCharades-STAR1@0.568.79DeCafNet
VideoCharades-STAR1@0.747.55DeCafNet
VideoCharades-STAR5@0.591.53DeCafNet
VideoCharades-STAR5@0.772.96DeCafNet
VideoMADR@1,IoU=0.113.25DeCafNet
VideoMADR@1,IoU=0.310.96DeCafNet
VideoMADR@5,IoU=0.127.73DeCafNet
VideoMADR@5,IoU=0.323.68DeCafNet
VideoMADR@1,IoU=0.113.25DeCafNet
VideoMADR@1,IoU=0.310.96DeCafNet
VideoMADR@1,IoU=0.57.06DeCafNet
VideoMADR@5,IoU=0.127.73DeCafNet
VideoMADR@5,IoU=0.323.68DeCafNet
VideoMADR@5,IoU=0.516.13DeCafNet
VideoTACoSR@1,IoU=0.357.36DeCafNet
VideoTACoSR@1,IoU=0.546.79DeCafNet
VideoTACoSR@5,IoU=0.181.05DeCafNet
VideoTACoSR@5,IoU=0.371.13DeCafNet
Video RetrievalMADR@1,IoU=0.113.25DeCafNet
Video RetrievalMADR@1,IoU=0.310.96DeCafNet
Video RetrievalMADR@5,IoU=0.127.73DeCafNet
Video RetrievalMADR@5,IoU=0.323.68DeCafNet
Video GroundingMADR@1,IoU=0.113.25DeCafNet
Video GroundingMADR@1,IoU=0.310.96DeCafNet
Video GroundingMADR@5,IoU=0.127.73DeCafNet
Video GroundingMADR@5,IoU=0.323.68DeCafNet
Temporal Sentence GroundingEgo4D-GoalstepR@1,IoU=0.323.2DeCafNet-100%
Temporal Sentence GroundingEgo4D-GoalstepR@1,IoU=0.519.4DeCafNet-100%
Temporal Sentence GroundingEgo4D-GoalstepR@5,IoU=0.351.38DeCafNet-100%
Temporal Sentence GroundingEgo4D-GoalstepR@5,IoU=0.544.17DeCafNet-100%
Temporal Sentence GroundingEgo4D-GoalstepR@1,IoU=0.321.29DeCafNet-50%
Temporal Sentence GroundingEgo4D-GoalstepR@1,IoU=0.517.46DeCafNet-50%
Temporal Sentence GroundingEgo4D-GoalstepR@5,IoU=0.347.27DeCafNet-50%
Temporal Sentence GroundingEgo4D-GoalstepR@5,IoU=0.540.4DeCafNet-50%
Temporal Sentence GroundingCharades-STAR1@0.568.79DeCafNet
Temporal Sentence GroundingCharades-STAR1@0.747.55DeCafNet
Temporal Sentence GroundingCharades-STAR5@0.591.53DeCafNet
Temporal Sentence GroundingCharades-STAR5@0.772.96DeCafNet
Natural Language QueriesEgo4DR@1 IoU=0.322.21DeCafNet-100%
Natural Language QueriesEgo4DR@1 IoU=0.515.52DeCafNet-100%
Natural Language QueriesEgo4DR@1 Mean(0.3 and 0.5)18.86DeCafNet-100%
Natural Language QueriesEgo4DR@5 IoU=0.345.63DeCafNet-100%
Natural Language QueriesEgo4DR@5 IoU=0.533.93DeCafNet-100%
Natural Language QueriesEgo4DR@1 IoU=0.320.81DeCafNet-50%
Natural Language QueriesEgo4DR@1 IoU=0.515.04DeCafNet-50%
Natural Language QueriesEgo4DR@1 Mean(0.3 and 0.5)17.93DeCafNet-50%
Natural Language QueriesEgo4DR@5 IoU=0.342.4DeCafNet-50%
Natural Language QueriesEgo4DR@5 IoU=0.531.68DeCafNet-50%
Natural Language QueriesEgo4DR@1 IoU=0.318.1DeCafNet-50% (no NaQ)
Natural Language QueriesEgo4DR@1 IoU=0.512.55DeCafNet-50% (no NaQ)
Natural Language QueriesEgo4DR@1 Mean(0.3 and 0.5)15.32DeCafNet-50% (no NaQ)
Natural Language QueriesEgo4DR@5 IoU=0.338.85DeCafNet-50% (no NaQ)
Natural Language QueriesEgo4DR@5 IoU=0.528.27DeCafNet-50% (no NaQ)

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding2025-06-27A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs2025-06-25Towards Probabilistic Question Answering Over Tabular Data2025-06-25Invocable APIs derived from NL2SQL datasets for LLM Tool-Calling Evaluation2025-06-12Improving Personalized Search with Regularized Low-Rank Parameter Updates2025-06-11MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding2025-06-10Technical Report for Argoverse2 Scenario Mining Challenges on Iterative Error Correction and Spatially-Aware Prompting2025-06-10