MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Jie Lei, Li-Wei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit Bansal

2020-05-11ACL 2020 6Video Captioning

Abstract

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events. All code is available open-source at: https://github.com/jayleicn/recurrent-transformer

Results

Task	Dataset	Metric	Value	Model
Video Captioning	ActivityNet Captions	BLEU4	10.33	MART (ae-test split) - Appearance + Flow
Video Captioning	ActivityNet Captions	CIDEr	23.42	MART (ae-test split) - Appearance + Flow
Video Captioning	ActivityNet Captions	METEOR	15.68	MART (ae-test split) - Appearance + Flow

Related Papers

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15 Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization2025-06-25 Dense Video Captioning using Graph-based Sentence Summarization2025-06-25 video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models2025-06-18 VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks2025-06-10 ARGUS: Hallucination and Omission Evaluation in Video-LLMs2025-06-09 Temporal Object Captioning for Street Scene Videos from LiDAR Tracks2025-05-22 FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks2025-05-19