TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Prior Knowledge Integration via LLM Encoding and Pseudo Ev...

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

Yiyang Jiang, WengYu Zhang, Xulu Zhang, XiaoYong Wei, Chang Wen Chen, Qing Li

2024-07-21Video GroundingHighlight DetectionMoment RetrievalRetrievalGeneral KnowledgeNatural Language Moment Retrieval
PaperPDFCode(official)

Abstract

In this paper, we investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution in video moment retrieval (VMR) models. The motivation behind this study arises from the limitations of using LLMs as decoders for generating discrete textual descriptions, which hinders their direct application to continuous outputs like salience scores and inter-frame embeddings that capture inter-frame relations. To overcome these limitations, we propose utilizing LLM encoders instead of decoders. Through a feasibility study, we demonstrate that LLM encoders effectively refine inter-concept relations in multimodal embeddings, even without being trained on textual embeddings. We also show that the refinement capability of LLM encoders can be transferred to other embeddings, such as BLIP and T5, as long as these embeddings exhibit similar inter-concept similarity patterns to CLIP embeddings. We present a general framework for integrating LLM encoders into existing VMR architectures, specifically within the fusion module. Through experimental validation, we demonstrate the effectiveness of our proposed methods by achieving state-of-the-art performance in VMR. The source code can be accessed at https://github.com/fletcherjiang/LLMEPET.

Results

TaskDatasetMetricValueModel
VideoQVHighlightsR@1,IoU=0.566.73LLMEPET
VideoQVHighlightsR@1,IoU=0.749.94LLMEPET
VideoTACoSR@1,IoU=0.352.73LLMEPET
VideoTACoSR@1,IoU=0.540.12LLMEPET
VideoTACoSR@1,IoU=0.722.78LLMEPET
VideoTACoSmIoU36.55LLMEPET
Video RetrievalQVHighlightsR@1,IoU=0.566.73LLMEPET
Video RetrievalQVHighlightsR@1,IoU=0.749.94LLMEPET
Moment RetrievalCharades-STAR@1 IoU=0.558.31LLMEPET
Moment RetrievalCharades-STAR@1 IoU=0.736.49LLMEPET
Moment RetrievalQVHighlightsR@1 IoU=0.566.73LLMEPET
Moment RetrievalQVHighlightsR@1 IoU=0.749.94LLMEPET
Moment RetrievalQVHighlightsmAP44.05LLMEPET
Moment RetrievalQVHighlightsmAP@0.565.76LLMEPET
Moment RetrievalQVHighlightsmAP@0.7543.91LLMEPET
Highlight DetectionYouTube HighlightsmAP75.3LLMEPET
Highlight DetectionQVHighlightsHit@165.69LLMEPET
Highlight DetectionQVHighlightsmAP40.33LLMEPET
Video GroundingQVHighlightsR@1,IoU=0.566.73LLMEPET
Video GroundingQVHighlightsR@1,IoU=0.749.94LLMEPET
16kYouTube HighlightsmAP75.3LLMEPET
16kQVHighlightsHit@165.69LLMEPET
16kQVHighlightsmAP40.33LLMEPET

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16