Diverse and Coherent Paragraph Generation from Images

Moitreya Chatterjee, Alexander G. Schwing

2018-09-03ECCV 2018 9Image Paragraph Captioning Video Summarization Image Captioning

Abstract

Paragraph generation from images, which has gained popularity recently, is an important task for video summarization, editing, and support of the disabled. Traditional image captioning methods fall short on this front, since they aren't designed to generate long informative descriptions. Moreover, the vanilla approach of simply concatenating multiple short sentences, possibly synthesized from a classical image captioning system, doesn't embrace the intricacies of paragraphs: coherent sentences, globally consistent structure, and diversity. To address those challenges, we propose to augment paragraph generation techniques with 'coherence vectors', 'global topic vectors', and modeling of the inherent ambiguity of associating paragraphs with images, via a variational auto-encoder formulation. We demonstrate the effectiveness of the developed approach on two datasets, outperforming existing state-of-the-art techniques on both.

Results

Task	Dataset	Metric	Value	Model
Image Paragraph Captioning	Image Paragraph Captioning	BLEU-4	9.43	Diverse and Coherent Paragraph Generation from Images
Image Paragraph Captioning	Image Paragraph Captioning	CIDEr	20.93	Diverse and Coherent Paragraph Generation from Images
Image Paragraph Captioning	Image Paragraph Captioning	METEOR	18.62	Diverse and Coherent Paragraph Generation from Images

Related Papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28 TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness2025-06-25 MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment2025-06-12 Prompts to Summaries: Zero-Shot Language-Guided Video Summarization2025-06-12 HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12 ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11 A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11