TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Do You Remember? Dense Video Captioning with Cross-Modal M...

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

2024-04-11CVPR 2024 1Text MatchingVideo CaptioningDense Video CaptioningRetrieval
PaperPDFCode(official)

Abstract

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.

Results

TaskDatasetMetricValueModel
Video CaptioningYouCook2BLEU41.63CM²
Video CaptioningYouCook2CIDEr31.66CM²
Video CaptioningYouCook2F128.43CM²
Video CaptioningYouCook2METEOR6.08CM²
Video CaptioningYouCook2Precision33.38CM²
Video CaptioningYouCook2Recall24.76CM²
Video CaptioningYouCook2SODA5.34CM²
Video CaptioningActivityNet CaptionsBLEU42.38CM²
Video CaptioningActivityNet CaptionsCIDEr33.01CM²
Video CaptioningActivityNet CaptionsF155.21CM²
Video CaptioningActivityNet CaptionsMETEOR8.55CM²
Video CaptioningActivityNet CaptionsPrecision56.81CM²
Video CaptioningActivityNet CaptionsRecall53.71CM²
Video CaptioningActivityNet CaptionsSODA6.18CM²
Dense Video CaptioningYouCook2BLEU41.63CM²
Dense Video CaptioningYouCook2CIDEr31.66CM²
Dense Video CaptioningYouCook2F128.43CM²
Dense Video CaptioningYouCook2METEOR6.08CM²
Dense Video CaptioningYouCook2Precision33.38CM²
Dense Video CaptioningYouCook2Recall24.76CM²
Dense Video CaptioningYouCook2SODA5.34CM²
Dense Video CaptioningActivityNet CaptionsBLEU42.38CM²
Dense Video CaptioningActivityNet CaptionsCIDEr33.01CM²
Dense Video CaptioningActivityNet CaptionsF155.21CM²
Dense Video CaptioningActivityNet CaptionsMETEOR8.55CM²
Dense Video CaptioningActivityNet CaptionsPrecision56.81CM²
Dense Video CaptioningActivityNet CaptionsRecall53.71CM²
Dense Video CaptioningActivityNet CaptionsSODA6.18CM²

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15