Licheng Yu, Mohit Bansal, Tamara L. Berg
We address the problem of end-to-end visual storytelling. Given a photo album, our model first selects the most representative (summary) photos, and then composes a natural language story for the album. For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story. Automatic and human evaluations show our model achieves better performance on selection, generation, and retrieval than baselines.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Text Generation | VIST | BLEU-3 | 20.78 | h-attn-rank |
| Text Generation | VIST | CIDEr | 7.38 | h-attn-rank |
| Text Generation | VIST | METEOR | 33.94 | h-attn-rank |
| Text Generation | VIST | ROUGE-L | 29.82 | h-attn-rank |
| Data-to-Text Generation | VIST | BLEU-3 | 20.78 | h-attn-rank |
| Data-to-Text Generation | VIST | CIDEr | 7.38 | h-attn-rank |
| Data-to-Text Generation | VIST | METEOR | 33.94 | h-attn-rank |
| Data-to-Text Generation | VIST | ROUGE-L | 29.82 | h-attn-rank |
| Visual Storytelling | VIST | BLEU-3 | 20.78 | h-attn-rank |
| Visual Storytelling | VIST | CIDEr | 7.38 | h-attn-rank |
| Visual Storytelling | VIST | METEOR | 33.94 | h-attn-rank |
| Visual Storytelling | VIST | ROUGE-L | 29.82 | h-attn-rank |
| Story Generation | VIST | BLEU-3 | 20.78 | h-attn-rank |
| Story Generation | VIST | CIDEr | 7.38 | h-attn-rank |
| Story Generation | VIST | METEOR | 33.94 | h-attn-rank |
| Story Generation | VIST | ROUGE-L | 29.82 | h-attn-rank |