TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VICTR: Visual Information Captured Text Representation for...

VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

Soyeon Caren Han, Siqu Long, Siwen Luo, Kunze Wang, Josiah Poon

2020-10-07Text-to-Image GenerationDependency Parsing
PaperPDFCode(official)

Abstract

Text-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images. We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. First, we use the text description as initial input and conduct dependency parsing to extract the syntactic structure and analyse the semantic aspect, including object quantities, to extract the scene graph. Then, we train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks, and it generates text representation which integrates textual and visual semantic information. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation. For the evaluation, we attached VICTR to the state-of-the-art models in text-to-image generation.VICTR is easily added to existing models and improves across both quantitative and qualitative aspects.

Results

TaskDatasetMetricValueModel
Image GenerationCOCO (Common Objects in Context)FID29.26AttnGAN + VICTR
Image GenerationCOCO (Common Objects in Context)Inception score28.18AttnGAN + VICTR
Image GenerationCOCO (Common Objects in Context)FID32.37DM-GAN + VICTR
Image GenerationCOCO (Common Objects in Context)Inception score32.37DM-GAN + VICTR
Image GenerationCOCO (Common Objects in Context)Inception score10.38StackGAN + VICTR
Text-to-Image GenerationCOCO (Common Objects in Context)FID29.26AttnGAN + VICTR
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score28.18AttnGAN + VICTR
Text-to-Image GenerationCOCO (Common Objects in Context)FID32.37DM-GAN + VICTR
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score32.37DM-GAN + VICTR
Text-to-Image GenerationCOCO (Common Objects in Context)Inception score10.38StackGAN + VICTR
10-shot image generationCOCO (Common Objects in Context)FID29.26AttnGAN + VICTR
10-shot image generationCOCO (Common Objects in Context)Inception score28.18AttnGAN + VICTR
10-shot image generationCOCO (Common Objects in Context)FID32.37DM-GAN + VICTR
10-shot image generationCOCO (Common Objects in Context)Inception score32.37DM-GAN + VICTR
10-shot image generationCOCO (Common Objects in Context)Inception score10.38StackGAN + VICTR
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID29.26AttnGAN + VICTR
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score28.18AttnGAN + VICTR
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID32.37DM-GAN + VICTR
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score32.37DM-GAN + VICTR
1 Image, 2*2 StitchiCOCO (Common Objects in Context)Inception score10.38StackGAN + VICTR

Related Papers

CharaConsist: Fine-Grained Consistent Character Generation2025-07-15Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09NeoBabel: A Multilingual Open Tower for Visual Generation2025-07-08DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer2025-07-07UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis2025-07-01Ovis-U1 Technical Report2025-06-29Rethink Sparse Signals for Pose-guided Text-to-image Generation2025-06-26XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation2025-06-26