HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

2024-12-19Video Captioning Dense Video Captioning

Abstract

With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge, such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical compact memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.

Results

Task	Dataset	Metric	Value	Model
Video Captioning	YouCook2	BLEU4	6.11	HiCM²
Video Captioning	YouCook2	CIDEr	71.84	HiCM²
Video Captioning	YouCook2	F1	32.51	HiCM²
Video Captioning	YouCook2	METEOR	12.8	HiCM²
Video Captioning	YouCook2	Precision	32.51	HiCM²
Video Captioning	YouCook2	Recall	32.51	HiCM²
Video Captioning	YouCook2	SODA	10.73	HiCM²
Video Captioning	ViTT	CIDEr	51.2	HiCM²
Video Captioning	ViTT	METEOR	9.6	HiCM²
Video Captioning	ViTT	SODA	0.15	HiCM²
Dense Video Captioning	YouCook2	BLEU4	6.11	HiCM²
Dense Video Captioning	YouCook2	CIDEr	71.84	HiCM²
Dense Video Captioning	YouCook2	F1	32.51	HiCM²
Dense Video Captioning	YouCook2	METEOR	12.8	HiCM²
Dense Video Captioning	YouCook2	Precision	32.51	HiCM²
Dense Video Captioning	YouCook2	Recall	32.51	HiCM²
Dense Video Captioning	YouCook2	SODA	10.73	HiCM²
Dense Video Captioning	ViTT	CIDEr	51.2	HiCM²
Dense Video Captioning	ViTT	METEOR	9.6	HiCM²
Dense Video Captioning	ViTT	SODA	0.15	HiCM²

HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

Abstract

Results

Related Papers

HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

Abstract

Results

Related Papers