Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

Yiyang Jiang, WengYu Zhang, Xulu Zhang, XiaoYong Wei, Chang Wen Chen, Qing Li

2024-07-21Video Grounding Highlight Detection Moment Retrieval Retrieval General Knowledge Natural Language Moment Retrieval

Paper PDF Code(official)

Abstract

In this paper, we investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution in video moment retrieval (VMR) models. The motivation behind this study arises from the limitations of using LLMs as decoders for generating discrete textual descriptions, which hinders their direct application to continuous outputs like salience scores and inter-frame embeddings that capture inter-frame relations. To overcome these limitations, we propose utilizing LLM encoders instead of decoders. Through a feasibility study, we demonstrate that LLM encoders effectively refine inter-concept relations in multimodal embeddings, even without being trained on textual embeddings. We also show that the refinement capability of LLM encoders can be transferred to other embeddings, such as BLIP and T5, as long as these embeddings exhibit similar inter-concept similarity patterns to CLIP embeddings. We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module. Through experimental validation, we demonstrate the effectiveness of our proposed methods by achieving state-of-the-art performance in VMR. The source code can be accessed at https://github.com/fletcherjiang/LLMEPET.

Results

Task	Dataset	Metric	Value	Model
Video	QVHighlights	R@1,IoU=0.5	66.73	LLMEPET
Video	QVHighlights	R@1,IoU=0.7	49.94	LLMEPET
Video	TACoS	R@1,IoU=0.3	52.73	LLMEPET
Video	TACoS	R@1,IoU=0.5	40.12	LLMEPET
Video	TACoS	R@1,IoU=0.7	22.78	LLMEPET
Video	TACoS	mIoU	36.55	LLMEPET
Video Retrieval	QVHighlights	R@1,IoU=0.5	66.73	LLMEPET
Video Retrieval	QVHighlights	R@1,IoU=0.7	49.94	LLMEPET
Moment Retrieval	Charades-STA	R@1 IoU=0.5	58.31	LLMEPET
Moment Retrieval	Charades-STA	R@1 IoU=0.7	36.49	LLMEPET
Moment Retrieval	QVHighlights	R@1 IoU=0.5	66.73	LLMEPET
Moment Retrieval	QVHighlights	R@1 IoU=0.7	49.94	LLMEPET
Moment Retrieval	QVHighlights	mAP	44.05	LLMEPET
Moment Retrieval	QVHighlights	mAP@0.5	65.76	LLMEPET
Moment Retrieval	QVHighlights	mAP@0.75	43.91	LLMEPET
Highlight Detection	YouTube Highlights	mAP	75.3	LLMEPET
Highlight Detection	QVHighlights	Hit@1	65.69	LLMEPET
Highlight Detection	QVHighlights	mAP	40.33	LLMEPET
Video Grounding	QVHighlights	R@1,IoU=0.5	66.73	LLMEPET
Video Grounding	QVHighlights	R@1,IoU=0.7	49.94	LLMEPET
16k	YouTube Highlights	mAP	75.3	LLMEPET
16k	QVHighlights	Hit@1	65.69	LLMEPET
16k	QVHighlights	mAP	40.33	LLMEPET

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

Abstract

Results

Related Papers

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

Abstract

Results

Related Papers