TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Hide-and-Tell: Learning to Bridge Photo Streams for Visual...

Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling

Yunjae Jung, Dahun Kim, Sanghyun Woo, Kyung-Su Kim, Sungjin Kim, In So Kweon

2020-02-03Image CaptioningVisual Storytelling
PaperPDF

Abstract

Visual storytelling is a task of creating a short story based on photo streams. Unlike existing visual captioning, storytelling aims to contain not only factual descriptions, but also human-like narration and semantics. However, the VIST dataset consists only of a small, fixed number of photos per story. Therefore, the main challenge of visual storytelling is to fill in the visual gap between photos with narrative and imaginative story. In this paper, we propose to explicitly learn to imagine a storyline that bridges the visual gap. During training, one or more photos is randomly omitted from the input stack, and we train the network to produce a full plausible story even with missing photo(s). Furthermore, we propose for visual storytelling a hide-and-tell model, which is designed to learn non-local relations across the photo streams and to refine and improve conventional RNN-based models. In experiments, we show that our scheme of hide-and-tell, and the network design are indeed effective at storytelling, and that our model outperforms previous state-of-the-art methods in automatic metrics. Finally, we qualitatively show the learned ability to interpolate storyline over visual gaps.

Results

TaskDatasetMetricValueModel
Text GenerationVISTBLEU-164.4INet
Text GenerationVISTBLEU-20.401INet
Text GenerationVISTBLEU-323.9INet
Text GenerationVISTBLEU-414.7INet
Text GenerationVISTCIDEr10INet
Text GenerationVISTMETEOR35.6INet
Text GenerationVISTROUGE-L29.7INet
Data-to-Text GenerationVISTBLEU-164.4INet
Data-to-Text GenerationVISTBLEU-20.401INet
Data-to-Text GenerationVISTBLEU-323.9INet
Data-to-Text GenerationVISTBLEU-414.7INet
Data-to-Text GenerationVISTCIDEr10INet
Data-to-Text GenerationVISTMETEOR35.6INet
Data-to-Text GenerationVISTROUGE-L29.7INet
Visual StorytellingVISTBLEU-164.4INet
Visual StorytellingVISTBLEU-20.401INet
Visual StorytellingVISTBLEU-323.9INet
Visual StorytellingVISTBLEU-414.7INet
Visual StorytellingVISTCIDEr10INet
Visual StorytellingVISTMETEOR35.6INet
Visual StorytellingVISTROUGE-L29.7INet
Story GenerationVISTBLEU-164.4INet
Story GenerationVISTBLEU-20.401INet
Story GenerationVISTBLEU-323.9INet
Story GenerationVISTBLEU-414.7INet
Story GenerationVISTCIDEr10INet
Story GenerationVISTMETEOR35.6INet
Story GenerationVISTROUGE-L29.7INet

Related Papers

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28Shape2Animal: Creative Animal Generation from Natural Silhouettes2025-06-25JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent2025-06-21HalLoc: Token-level Localization of Hallucinations for Vision Language Models2025-06-12ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs2025-06-11A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning2025-06-11