TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/StoryDALL-E: Adapting Pretrained Text-to-Image Transformer...

StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

Adyasha Maharana, Darryl Hannan, Mohit Bansal

2022-09-13Story VisualizationVideo CaptioningImage GenerationStory Continuation
PaperPDFCode(official)

Abstract

Recent advances in text-to-image synthesis have led to large pretrained transformers with excellent capabilities to generate visualizations from a given text. However, these models are ill-suited for specialized tasks like story visualization, which requires an agent to produce a sequence of images given a corresponding sequence of captions, forming a narrative. Moreover, we find that the story visualization task fails to accommodate generalization to unseen plots and characters in new narratives. Hence, we first propose the task of story continuation, where the generated visual story is conditioned on a source image, allowing for better generalization to narratives with new characters. Then, we enhance or 'retro-fit' the pretrained text-to-image synthesis models with task-specific modules for (a) sequential image generation and (b) copying relevant elements from an initial frame. Then, we explore full-model finetuning, as well as prompt-based tuning for parameter-efficient adaptation, of the pre-trained model. We evaluate our approach StoryDALL-E on two existing datasets, PororoSV and FlintstonesSV, and introduce a new dataset DiDeMoSV collected from a video-captioning dataset. We also develop a model StoryGANc based on Generative Adversarial Networks (GAN) for story continuation, and compare it with the StoryDALL-E model to demonstrate the advantages of our approach. We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image, thereby improving continuity in the generated visual story. Finally, our analysis suggests that pretrained transformers struggle to comprehend narratives containing several characters. Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.

Results

TaskDatasetMetricValueModel
Story ContinuationPororoSVChar-F140.28StoryDALL-E
Story ContinuationPororoSVF-Acc20.94StoryDALL-E
Story ContinuationPororoSVFID21.64StoryDALL-E
Story ContinuationPororoSVChar-F140.25StoryDALL-E (Cross-Attention)
Story ContinuationPororoSVF-Acc18.16StoryDALL-E (Cross-Attention)
Story ContinuationPororoSVFID23.27StoryDALL-E (Cross-Attention)
Story ContinuationPororoSVChar-F139.32StoryDALL-E (Story Embeddings)
Story ContinuationPororoSVF-Acc34.65StoryDALL-E (Story Embeddings)
Story ContinuationPororoSVFID30.45StoryDALL-E (Story Embeddings)
Story ContinuationPororoSVChar-F135.29StoryDALL-E (Story Embeddings + Cross-Attention)
Story ContinuationPororoSVF-Acc16.73StoryDALL-E (Story Embeddings + Cross-Attention)
Story ContinuationPororoSVFID31.68StoryDALL-E (Story Embeddings + Cross-Attention)
Story ContinuationFlintstonesSVChar-F174.28StoryDALL-E
Story ContinuationFlintstonesSVF-Acc52.35StoryDALL-E
Story ContinuationFlintstonesSVFID28.37StoryDALL-E
Story ContinuationFlintstonesSVChar-F172.18StoryDALL-E (Story Embeddings)
Story ContinuationFlintstonesSVF-Acc53.28StoryDALL-E (Story Embeddings)
Story ContinuationFlintstonesSVFID29.21StoryDALL-E (Story Embeddings)
Story ContinuationFlintstonesSVChar-F173.94StoryDALL-E (Cross-Attention)
Story ContinuationFlintstonesSVF-Acc52.72StoryDALL-E (Cross-Attention)
Story ContinuationFlintstonesSVFID35.04StoryDALL-E (Cross-Attention)
Story ContinuationFlintstonesSVChar-F172.44StoryDALL-E (Story Embeddings + Cross-Attention)
Story ContinuationFlintstonesSVF-Acc51.32StoryDALL-E (Story Embeddings + Cross-Attention)
Story ContinuationFlintstonesSVFID36.28StoryDALL-E (Story Embeddings + Cross-Attention)

Related Papers

fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17FADE: Adversarial Concept Erasure in Flow Models2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15CharaConsist: Fine-Grained Consistent Character Generation2025-07-15