TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VLG-Net: Video-Language Graph Matching Network for Video G...

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Mattia Soldan, Mengmeng Xu, Sisi Qu, Jesper Tegner, Bernard Ghanem

2020-11-19Video GroundingMoment RetrievalTemporal LocalizationGraph MatchingNatural Language Moment Retrieval
PaperPDFCode

Abstract

Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands understanding videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the modalities, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs built atop video snippets and query tokens separately and used to model intra-modality relationships. A Graph Matching layer is adopted for cross-modal context modeling and multi-modal fusion. Finally, moment candidates are created using masked moment attention pooling by fusing the moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with language queries: ActivityNet-Captions, TACoS, and DiDeMo.

Results

TaskDatasetMetricValueModel
VideoTACoSR@1,IoU=0.345.46VLG-Net
VideoTACoSR@1,IoU=0.534.19VLG-Net
VideoTACoSR@5,IoU=0.181.8VLG-Net
VideoTACoSR@5,IoU=0.370.38VLG-Net
VideoTACoSR@5,IoU=0.556.56VLG-Net
VideoActivityNet CaptionsR@1,IoU=0.546.32VLG-Net
VideoActivityNet CaptionsR@1,IoU=0.729.82VLG-Net
VideoActivityNet CaptionsR@5,IoU=0.577.15VLG-Net
VideoActivityNet CaptionsR@5,IoU=0.763.33VLG-Net
VideoDiDeMoR@1,IoU=0.533.35VLG-Net
VideoDiDeMoR@1,IoU=0.725.57VLG-Net
VideoDiDeMoR@1,IoU=1.025.57VLG-Net
VideoDiDeMoR@5,IoU=0.588.86VLG-Net
VideoDiDeMoR@5,IoU=0.771.72VLG-Net
VideoDiDeMoR@5,IoU=1.071.65VLG-Net

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding2025-06-16Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements2025-06-11VideoMolmo: Spatio-Temporal Grounding Meets Pointing2025-06-05Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency2025-06-02Probing Neural Topology of Large Language Models2025-06-01PackHero: A Scalable Graph-based Approach for Efficient Packer Identification2025-05-31DisTime: Distribution-based Time Representation for Video Large Language Models2025-05-30