TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Informative Visual Storytelling with Cross-modal Rules

Informative Visual Storytelling with Cross-modal Rules

Jiacheng Li, Haizhou Shi, Siliang Tang, Fei Wu, Yueting Zhuang

2019-07-07Story GenerationVisual Storytelling
PaperPDFCode(official)

Abstract

Existing methods in the Visual Storytelling field often suffer from the problem of generating general descriptions, while the image contains a lot of meaningful contents remaining unnoticed. The failure of informative story generation can be concluded to the model's incompetence of capturing enough meaningful concepts. The categories of these concepts include entities, attributes, actions, and events, which are in some cases crucial to grounded storytelling. To solve this problem, we propose a method to mine the cross-modal rules to help the model infer these informative concepts given certain visual input. We first build the multimodal transactions by concatenating the CNN activations and the word indices. Then we use the association rule mining algorithm to mine the cross-modal rules, which will be used for the concept inference. With the help of the cross-modal rules, the generated stories are more grounded and informative. Besides, our proposed method holds the advantages of interpretation, expandability, and transferability, indicating potential for wider application. Finally, we leverage these concepts in our encoder-decoder framework with the attention mechanism. We conduct several experiments on the VIsual StoryTelling~(VIST) dataset, the results of which demonstrate the effectiveness of our approach in terms of both automatic metrics and human evaluation. Additional experiments are also conducted showing that our mined cross-modal rules as additional knowledge helps the model gain better performance when trained on a small dataset.

Results

TaskDatasetMetricValueModel
Text GenerationVISTBLEU-163.8VSCMR
Text GenerationVISTBLEU-414.3VSCMR
Text GenerationVISTCIDEr9VSCMR
Text GenerationVISTMETEOR35.5VSCMR
Text GenerationVISTROUGE-L30.2VSCMR
Data-to-Text GenerationVISTBLEU-163.8VSCMR
Data-to-Text GenerationVISTBLEU-414.3VSCMR
Data-to-Text GenerationVISTCIDEr9VSCMR
Data-to-Text GenerationVISTMETEOR35.5VSCMR
Data-to-Text GenerationVISTROUGE-L30.2VSCMR
Visual StorytellingVISTBLEU-163.8VSCMR
Visual StorytellingVISTBLEU-414.3VSCMR
Visual StorytellingVISTCIDEr9VSCMR
Visual StorytellingVISTMETEOR35.5VSCMR
Visual StorytellingVISTROUGE-L30.2VSCMR
Story GenerationVISTBLEU-163.8VSCMR
Story GenerationVISTBLEU-414.3VSCMR
Story GenerationVISTCIDEr9VSCMR
Story GenerationVISTMETEOR35.5VSCMR
Story GenerationVISTROUGE-L30.2VSCMR

Related Papers

Compressed and Smooth Latent Space for Text Diffusion Modeling2025-06-26Shape2Animal: Creative Animal Generation from Natural Silhouettes2025-06-25JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent2025-06-21StoryWriter: A Multi-Agent Framework for Long Story Generation2025-06-19VINCIE: Unlocking In-context Image Editing from Video2025-06-12Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective2025-06-11Consistent Story Generation with Asymmetry Zigzag Sampling2025-06-11Counterfactual reasoning: an analysis of in-context emergence2025-06-05