Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Teng Wang, Jinrui Zhang, Feng Zheng, Wenhao Jiang, Ran Cheng, Ping Luo

2023-03-11Text Generation Video Captioning Dense Video Captioning Natural Language Moment Retrieval

Abstract

Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge. Our code is publicly available at https://github.com/zjr2000/GVL.

Results

Task	Dataset	Metric	Value	Model
Video	TACoS	R@1,IoU=0.3	48.29	GVL (paragraph-level)
Video	TACoS	R@1,IoU=0.5	36.07	GVL (paragraph-level)
Video	TACoS	R@1,IoU=0.3	45.92	GVL
Video	TACoS	R@1,IoU=0.5	34.57	GVL
Video	ActivityNet Captions	R@1,IoU=0.5	60.67	GVL (paragraph-level)
Video	ActivityNet Captions	R@1,IoU=0.7	38.55	GVL (paragraph-level)
Video	ActivityNet Captions	R@1,IoU=0.5	49.18	GVL
Video	ActivityNet Captions	R@1,IoU=0.7	29.69	GVL
Video Captioning	YouCook2	CIDEr	26.52	GVL
Video Captioning	YouCook2	METEOR	5.01	GVL
Video Captioning	YouCook2	SODA	4.91	GVL
Video Captioning	ActivityNet Captions	CIDEr	33.33	GVL
Video Captioning	ActivityNet Captions	METEOR	10.03	GVL
Video Captioning	ActivityNet Captions	SODA	7.11	GVL
Dense Video Captioning	YouCook2	CIDEr	26.52	GVL
Dense Video Captioning	YouCook2	METEOR	5.01	GVL
Dense Video Captioning	YouCook2	SODA	4.91	GVL
Dense Video Captioning	ActivityNet Captions	CIDEr	33.33	GVL
Dense Video Captioning	ActivityNet Captions	METEOR	10.03	GVL
Dense Video Captioning	ActivityNet Captions	SODA	7.11	GVL

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Abstract

Results

Related Papers

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Abstract

Results

Related Papers