TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/L-Verse: Bidirectional Generation Between Image and Text

L-Verse: Bidirectional Generation Between Image and Text

TaeHoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae

2021-11-22CVPR 2022 1Text-to-Image GenerationText GenerationRepresentation LearningImage to textImage ReconstructionText to Image GenerationImage CaptioningImage Generationobject-detectionObject Detection
PaperPDFCode(official)

Abstract

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation without any finetuning or extra object detection framework. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial result of bidirectional vision-language representation learning on general domain.

Results

TaskDatasetMetricValueModel
Image GenerationCOCO (Common Objects in Context)FID37.2L-Verse-CC
Image GenerationCOCO (Common Objects in Context)FID-131.6L-Verse-CC
Image GenerationCOCO (Common Objects in Context)FID-225.7L-Verse-CC
Image GenerationCOCO (Common Objects in Context)FID-421.4L-Verse-CC
Image GenerationCOCO (Common Objects in Context)FID-821.1L-Verse-CC
Image GenerationCOCO (Common Objects in Context)FID45.8L-Verse
Image GenerationCOCO (Common Objects in Context)FID-141.9L-Verse
Image GenerationCOCO (Common Objects in Context)FID-235.5L-Verse
Image GenerationCOCO (Common Objects in Context)FID-430.2L-Verse
Image GenerationCOCO (Common Objects in Context)FID-829.83L-Verse
Image CaptioningCOCO CaptionsBLEU-439.9L-Verse
Image CaptioningCOCO CaptionsMETEOR31.4L-Verse
Image CaptioningCOCO CaptionsROUGE-L60.4L-Verse
Image CaptioningCOCO CaptionsSPICE23.3L-Verse
Image ReconstructionImageNet 256x256FID1.04AugVAE-ML
Image ReconstructionImageNet 256x256FID3.28AugVAE-SL
Text-to-Image GenerationCOCO (Common Objects in Context)FID37.2L-Verse-CC
Text-to-Image GenerationCOCO (Common Objects in Context)FID-131.6L-Verse-CC
Text-to-Image GenerationCOCO (Common Objects in Context)FID-225.7L-Verse-CC
Text-to-Image GenerationCOCO (Common Objects in Context)FID-421.4L-Verse-CC
Text-to-Image GenerationCOCO (Common Objects in Context)FID-821.1L-Verse-CC
Text-to-Image GenerationCOCO (Common Objects in Context)FID45.8L-Verse
Text-to-Image GenerationCOCO (Common Objects in Context)FID-141.9L-Verse
Text-to-Image GenerationCOCO (Common Objects in Context)FID-235.5L-Verse
Text-to-Image GenerationCOCO (Common Objects in Context)FID-430.2L-Verse
Text-to-Image GenerationCOCO (Common Objects in Context)FID-829.83L-Verse
10-shot image generationCOCO (Common Objects in Context)FID37.2L-Verse-CC
10-shot image generationCOCO (Common Objects in Context)FID-131.6L-Verse-CC
10-shot image generationCOCO (Common Objects in Context)FID-225.7L-Verse-CC
10-shot image generationCOCO (Common Objects in Context)FID-421.4L-Verse-CC
10-shot image generationCOCO (Common Objects in Context)FID-821.1L-Verse-CC
10-shot image generationCOCO (Common Objects in Context)FID45.8L-Verse
10-shot image generationCOCO (Common Objects in Context)FID-141.9L-Verse
10-shot image generationCOCO (Common Objects in Context)FID-235.5L-Verse
10-shot image generationCOCO (Common Objects in Context)FID-430.2L-Verse
10-shot image generationCOCO (Common Objects in Context)FID-829.83L-Verse
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID37.2L-Verse-CC
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-131.6L-Verse-CC
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-225.7L-Verse-CC
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-421.4L-Verse-CC
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-821.1L-Verse-CC
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID45.8L-Verse
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-141.9L-Verse
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-235.5L-Verse
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-430.2L-Verse
1 Image, 2*2 StitchiCOCO (Common Objects in Context)FID-829.83L-Verse

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20Making Language Model a Hierarchical Classifier and Generator2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17Synthesizing Reality: Leveraging the Generative AI-Powered Platform Midjourney for Construction Worker Detection2025-07-17FashionPose: Text to Pose to Relight Image Generation for Personalized Fashion Visualization2025-07-17A Distributed Generative AI Approach for Heterogeneous Multi-Domain Environments under Data Sharing constraints2025-07-17