ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context

Sixiao Zheng, Yanwei Fu

2024-07-13Story Visualization Text-to-Image Generation Image Generation Story Continuation Visual Storytelling

Abstract

Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle with high memory usage, slow generation speeds, and limited context integration. To address these issues, we propose ContextualStory, a novel framework designed to generate coherent story frames and extend frames for visual storytelling. ContextualStory utilizes Spatially-Enhanced Temporal Attention to capture spatial and temporal dependencies, handling significant character movements effectively. Additionally, we introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames for guiding the model. Extensive experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation. Code is available at https://github.com/sixiaozheng/ContextualStory.

Results

Task	Dataset	Metric	Value	Model
Text-To-Image	Pororo	FID	14.07	ContextualStory
Story Continuation	PororoSV	FID	14.2	ContextualStory
Story Continuation	FlintstonesSV	FID	16.33	ContextualStory

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17 Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17 FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17 A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17 Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17 FADE: Adversarial Concept Erasure in Flow Models2025-07-16 CharaConsist: Fine-Grained Consistent Character Generation2025-07-15 CATVis: Context-Aware Thought Visualization2025-07-15