TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Hierarchical Approach for Generating Descriptive Image P...

A Hierarchical Approach for Generating Descriptive Image Paragraphs

Jonathan Krause, Justin Johnson, Ranjay Krishna, Li Fei-Fei

2016-11-20CVPR 2017 7Image Paragraph CaptioningDescriptiveImage CaptioningDense Captioning
PaperPDFCodeCodeCode

Abstract

Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail. While one new captioning approach, dense captioning, can potentially describe images in finer levels of detail by captioning many regions within an image, it in turn is unable to produce a coherent story for an image. In this paper we overcome these limitations by generating entire paragraphs for describing images, which can tell detailed, unified stories. We develop a model that decomposes both images and paragraphs into their constituent parts, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language. Linguistic analysis confirms the complexity of the paragraph generation task, and thorough experiments on a new dataset of image and paragraph pairs demonstrate the effectiveness of our approach.

Results

TaskDatasetMetricValueModel
Image Paragraph CaptioningImage Paragraph CaptioningBLEU-48.69Regions-Hierarchical (ours)
Image Paragraph CaptioningImage Paragraph CaptioningCIDEr13.52Regions-Hierarchical (ours)
Image Paragraph CaptioningImage Paragraph CaptioningMETEOR15.95Regions-Hierarchical (ours)

Related Papers

DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation2025-07-09Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor2025-07-04Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization2025-07-03Dataset Distillation via Vision-Language Category Prototype2025-06-30