TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Vision Transformer Based Model for Describing a Set of Ima...

Vision Transformer Based Model for Describing a Set of Images as a Story

Zainy M. Malakan, Ghulam Mubashar Hassan, Ajmal Mian

2022-10-06Visual StorytellingLanguage Modelling
PaperPDF

Abstract

Visual Story-Telling is the process of forming a multi-sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the sequence encoder. This captures the past and future image context of all image patches. Then, an attention mechanism is implemented and used to increase the discriminatory capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST), and the results show that our model outperforms the current state of the art models.

Results

TaskDatasetMetricValueModel
Text GenerationVISTBLEU-163ViT-model
Text GenerationVISTBLEU-237.5ViT-model
Text GenerationVISTBLEU-321.5ViT-model
Text GenerationVISTBLEU-412.3ViT-model
Text GenerationVISTCIDEr4.4ViT-model
Text GenerationVISTMETEOR35.4ViT-model
Text GenerationVISTROUGE-L31ViT-model
Data-to-Text GenerationVISTBLEU-163ViT-model
Data-to-Text GenerationVISTBLEU-237.5ViT-model
Data-to-Text GenerationVISTBLEU-321.5ViT-model
Data-to-Text GenerationVISTBLEU-412.3ViT-model
Data-to-Text GenerationVISTCIDEr4.4ViT-model
Data-to-Text GenerationVISTMETEOR35.4ViT-model
Data-to-Text GenerationVISTROUGE-L31ViT-model
Visual StorytellingVISTBLEU-163ViT-model
Visual StorytellingVISTBLEU-237.5ViT-model
Visual StorytellingVISTBLEU-321.5ViT-model
Visual StorytellingVISTBLEU-412.3ViT-model
Visual StorytellingVISTCIDEr4.4ViT-model
Visual StorytellingVISTMETEOR35.4ViT-model
Visual StorytellingVISTROUGE-L31ViT-model
Story GenerationVISTBLEU-163ViT-model
Story GenerationVISTBLEU-237.5ViT-model
Story GenerationVISTBLEU-321.5ViT-model
Story GenerationVISTBLEU-412.3ViT-model
Story GenerationVISTCIDEr4.4ViT-model
Story GenerationVISTMETEOR35.4ViT-model
Story GenerationVISTROUGE-L31ViT-model

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16