L-Verse: Bidirectional Generation Between Image and Text

TaeHoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae

2021-11-22CVPR 2022 1Text-to-Image Generation Text Generation Representation Learning Image to text Image Reconstruction Text to Image Generation Image Captioning Image Generation object-detection Object Detection

Paper PDF Code(official)

Abstract

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation without any finetuning or extra object detection framework. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial result of bidirectional vision-language representation learning on general domain.

Results

Task	Dataset	Metric	Value	Model
Image Generation	COCO (Common Objects in Context)	FID	37.2	L-Verse-CC
Image Generation	COCO (Common Objects in Context)	FID-1	31.6	L-Verse-CC
Image Generation	COCO (Common Objects in Context)	FID-2	25.7	L-Verse-CC
Image Generation	COCO (Common Objects in Context)	FID-4	21.4	L-Verse-CC
Image Generation	COCO (Common Objects in Context)	FID-8	21.1	L-Verse-CC
Image Generation	COCO (Common Objects in Context)	FID	45.8	L-Verse
Image Generation	COCO (Common Objects in Context)	FID-1	41.9	L-Verse
Image Generation	COCO (Common Objects in Context)	FID-2	35.5	L-Verse
Image Generation	COCO (Common Objects in Context)	FID-4	30.2	L-Verse
Image Generation	COCO (Common Objects in Context)	FID-8	29.83	L-Verse
Image Captioning	COCO Captions	BLEU-4	39.9	L-Verse
Image Captioning	COCO Captions	METEOR	31.4	L-Verse
Image Captioning	COCO Captions	ROUGE-L	60.4	L-Verse
Image Captioning	COCO Captions	SPICE	23.3	L-Verse
Image Reconstruction	ImageNet 256x256	FID	1.04	AugVAE-ML
Image Reconstruction	ImageNet 256x256	FID	3.28	AugVAE-SL
Text-to-Image Generation	COCO (Common Objects in Context)	FID	37.2	L-Verse-CC
Text-to-Image Generation	COCO (Common Objects in Context)	FID-1	31.6	L-Verse-CC
Text-to-Image Generation	COCO (Common Objects in Context)	FID-2	25.7	L-Verse-CC
Text-to-Image Generation	COCO (Common Objects in Context)	FID-4	21.4	L-Verse-CC
Text-to-Image Generation	COCO (Common Objects in Context)	FID-8	21.1	L-Verse-CC
Text-to-Image Generation	COCO (Common Objects in Context)	FID	45.8	L-Verse
Text-to-Image Generation	COCO (Common Objects in Context)	FID-1	41.9	L-Verse
Text-to-Image Generation	COCO (Common Objects in Context)	FID-2	35.5	L-Verse
Text-to-Image Generation	COCO (Common Objects in Context)	FID-4	30.2	L-Verse
Text-to-Image Generation	COCO (Common Objects in Context)	FID-8	29.83	L-Verse
10-shot image generation	COCO (Common Objects in Context)	FID	37.2	L-Verse-CC
10-shot image generation	COCO (Common Objects in Context)	FID-1	31.6	L-Verse-CC
10-shot image generation	COCO (Common Objects in Context)	FID-2	25.7	L-Verse-CC
10-shot image generation	COCO (Common Objects in Context)	FID-4	21.4	L-Verse-CC
10-shot image generation	COCO (Common Objects in Context)	FID-8	21.1	L-Verse-CC
10-shot image generation	COCO (Common Objects in Context)	FID	45.8	L-Verse
10-shot image generation	COCO (Common Objects in Context)	FID-1	41.9	L-Verse
10-shot image generation	COCO (Common Objects in Context)	FID-2	35.5	L-Verse
10-shot image generation	COCO (Common Objects in Context)	FID-4	30.2	L-Verse
10-shot image generation	COCO (Common Objects in Context)	FID-8	29.83	L-Verse
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	37.2	L-Verse-CC
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-1	31.6	L-Verse-CC
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-2	25.7	L-Verse-CC
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-4	21.4	L-Verse-CC
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-8	21.1	L-Verse-CC
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID	45.8	L-Verse
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-1	41.9	L-Verse
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-2	35.5	L-Verse
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-4	30.2	L-Verse
1 Image, 2*2 Stitchi	COCO (Common Objects in Context)	FID-8	29.83	L-Verse

L-Verse: Bidirectional Generation Between Image and Text

Abstract

Results

Related Papers

L-Verse: Bidirectional Generation Between Image and Text

Abstract

Results

Related Papers