Hierarchically-Attentive RNN for Album Summarization and Storytelling

Licheng Yu, Mohit Bansal, Tamara L. Berg

2017-08-09EMNLP 2017 9Retrieval Visual Storytelling

Abstract

We address the problem of end-to-end visual storytelling. Given a photo album, our model first selects the most representative (summary) photos, and then composes a natural language story for the album. For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story. Automatic and human evaluations show our model achieves better performance on selection, generation, and retrieval than baselines.

Results

Task	Dataset	Metric	Value	Model
Text Generation	VIST	BLEU-3	20.78	h-attn-rank
Text Generation	VIST	CIDEr	7.38	h-attn-rank
Text Generation	VIST	METEOR	33.94	h-attn-rank
Text Generation	VIST	ROUGE-L	29.82	h-attn-rank
Data-to-Text Generation	VIST	BLEU-3	20.78	h-attn-rank
Data-to-Text Generation	VIST	CIDEr	7.38	h-attn-rank
Data-to-Text Generation	VIST	METEOR	33.94	h-attn-rank
Data-to-Text Generation	VIST	ROUGE-L	29.82	h-attn-rank
Visual Storytelling	VIST	BLEU-3	20.78	h-attn-rank
Visual Storytelling	VIST	CIDEr	7.38	h-attn-rank
Visual Storytelling	VIST	METEOR	33.94	h-attn-rank
Visual Storytelling	VIST	ROUGE-L	29.82	h-attn-rank
Story Generation	VIST	BLEU-3	20.78	h-attn-rank
Story Generation	VIST	CIDEr	7.38	h-attn-rank
Story Generation	VIST	METEOR	33.94	h-attn-rank
Story Generation	VIST	ROUGE-L	29.82	h-attn-rank

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17 A Survey of Context Engineering for Large Language Models2025-07-17 MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17 Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 Context-Aware Search and Retrieval Over Erasure Channels2025-07-16 Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15