TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Learning Grounded Vision-Language Representation for Versa...

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Teng Wang, Jinrui Zhang, Feng Zheng, Wenhao Jiang, Ran Cheng, Ping Luo

2023-03-11Text GenerationVideo CaptioningDense Video CaptioningNatural Language Moment Retrieval
PaperPDFCode(official)

Abstract

Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge. Our code is publicly available at https://github.com/zjr2000/GVL.

Results

TaskDatasetMetricValueModel
VideoTACoSR@1,IoU=0.348.29GVL (paragraph-level)
VideoTACoSR@1,IoU=0.536.07GVL (paragraph-level)
VideoTACoSR@1,IoU=0.345.92GVL
VideoTACoSR@1,IoU=0.534.57GVL
VideoActivityNet CaptionsR@1,IoU=0.560.67GVL (paragraph-level)
VideoActivityNet CaptionsR@1,IoU=0.738.55GVL (paragraph-level)
VideoActivityNet CaptionsR@1,IoU=0.549.18GVL
VideoActivityNet CaptionsR@1,IoU=0.729.69GVL
Video CaptioningYouCook2CIDEr26.52GVL
Video CaptioningYouCook2METEOR5.01GVL
Video CaptioningYouCook2SODA4.91GVL
Video CaptioningActivityNet CaptionsCIDEr33.33GVL
Video CaptioningActivityNet CaptionsMETEOR10.03GVL
Video CaptioningActivityNet CaptionsSODA7.11GVL
Dense Video CaptioningYouCook2CIDEr26.52GVL
Dense Video CaptioningYouCook2METEOR5.01GVL
Dense Video CaptioningYouCook2SODA4.91GVL
Dense Video CaptioningActivityNet CaptionsCIDEr33.33GVL
Dense Video CaptioningActivityNet CaptionsMETEOR10.03GVL
Dense Video CaptioningActivityNet CaptionsSODA7.11GVL

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15Exploiting Leaderboards for Large-Scale Distribution of Malicious Models2025-07-11CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs2025-07-09