A Hierarchical Approach for Generating Descriptive Image Paragraphs

Jonathan Krause, Justin Johnson, Ranjay Krishna, Li Fei-Fei

2016-11-20CVPR 2017 7Image Paragraph Captioning Descriptive Image Captioning Dense Captioning

Abstract

Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail. While one new captioning approach, dense captioning, can potentially describe images in finer levels of detail by captioning many regions within an image, it in turn is unable to produce a coherent story for an image. In this paper we overcome these limitations by generating entire paragraphs for describing images, which can tell detailed, unified stories. We develop a model that decomposes both images and paragraphs into their constituent parts, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language. Linguistic analysis confirms the complexity of the paragraph generation task, and thorough experiments on a new dataset of image and paragraph pairs demonstrate the effectiveness of our approach.

Results

Task	Dataset	Metric	Value	Model
Image Paragraph Captioning	Image Paragraph Captioning	BLEU-4	8.69	Regions-Hierarchical (ours)
Image Paragraph Captioning	Image Paragraph Captioning	CIDEr	13.52	Regions-Hierarchical (ours)
Image Paragraph Captioning	Image Paragraph Captioning	METEOR	15.95	Regions-Hierarchical (ours)

Related Papers

DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17 Assay2Mol: large language model-based drug design using BioAssay context2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation2025-07-09 Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor2025-07-04 Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization2025-07-03 Dataset Distillation via Vision-Language Category Prototype2025-06-30