TaeHoon Kim, Gwangmo Song, Sihaeng Lee, Sangyun Kim, Yewon Seo, Soonyoung Lee, Seung Hwan Kim, Honglak Lee, Kyunghoon Bae
Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalability. Especially with cross-modal tasks between image and text, vector quantized variational autoencoders (VQ-VAEs) are widely used to make a raw RGB image into a sequence of feature vectors. To better leverage the correlation between image and text, we propose L-Verse, a novel architecture consisting of feature-augmented variational autoencoder (AugVAE) and bidirectional auto-regressive transformer (BiART) for image-to-text and text-to-image generation. Our AugVAE shows the state-of-the-art reconstruction performance on ImageNet1K validation set, along with the robustness to unseen images in the wild. Unlike other models, BiART can distinguish between image (or text) as a conditional reference and a generation target. L-Verse can be directly used for image-to-text or text-to-image generation without any finetuning or extra object detection framework. In quantitative and qualitative experiments, L-Verse shows impressive results against previous methods in both image-to-text and text-to-image generation on MS-COCO Captions. We furthermore assess the scalability of L-Verse architecture on Conceptual Captions and present the initial result of bidirectional vision-language representation learning on general domain.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Image Generation | COCO (Common Objects in Context) | FID | 37.2 | L-Verse-CC |
| Image Generation | COCO (Common Objects in Context) | FID-1 | 31.6 | L-Verse-CC |
| Image Generation | COCO (Common Objects in Context) | FID-2 | 25.7 | L-Verse-CC |
| Image Generation | COCO (Common Objects in Context) | FID-4 | 21.4 | L-Verse-CC |
| Image Generation | COCO (Common Objects in Context) | FID-8 | 21.1 | L-Verse-CC |
| Image Generation | COCO (Common Objects in Context) | FID | 45.8 | L-Verse |
| Image Generation | COCO (Common Objects in Context) | FID-1 | 41.9 | L-Verse |
| Image Generation | COCO (Common Objects in Context) | FID-2 | 35.5 | L-Verse |
| Image Generation | COCO (Common Objects in Context) | FID-4 | 30.2 | L-Verse |
| Image Generation | COCO (Common Objects in Context) | FID-8 | 29.83 | L-Verse |
| Image Captioning | COCO Captions | BLEU-4 | 39.9 | L-Verse |
| Image Captioning | COCO Captions | METEOR | 31.4 | L-Verse |
| Image Captioning | COCO Captions | ROUGE-L | 60.4 | L-Verse |
| Image Captioning | COCO Captions | SPICE | 23.3 | L-Verse |
| Image Reconstruction | ImageNet 256x256 | FID | 1.04 | AugVAE-ML |
| Image Reconstruction | ImageNet 256x256 | FID | 3.28 | AugVAE-SL |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID | 37.2 | L-Verse-CC |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-1 | 31.6 | L-Verse-CC |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-2 | 25.7 | L-Verse-CC |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-4 | 21.4 | L-Verse-CC |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-8 | 21.1 | L-Verse-CC |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID | 45.8 | L-Verse |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-1 | 41.9 | L-Verse |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-2 | 35.5 | L-Verse |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-4 | 30.2 | L-Verse |
| Text-to-Image Generation | COCO (Common Objects in Context) | FID-8 | 29.83 | L-Verse |
| 10-shot image generation | COCO (Common Objects in Context) | FID | 37.2 | L-Verse-CC |
| 10-shot image generation | COCO (Common Objects in Context) | FID-1 | 31.6 | L-Verse-CC |
| 10-shot image generation | COCO (Common Objects in Context) | FID-2 | 25.7 | L-Verse-CC |
| 10-shot image generation | COCO (Common Objects in Context) | FID-4 | 21.4 | L-Verse-CC |
| 10-shot image generation | COCO (Common Objects in Context) | FID-8 | 21.1 | L-Verse-CC |
| 10-shot image generation | COCO (Common Objects in Context) | FID | 45.8 | L-Verse |
| 10-shot image generation | COCO (Common Objects in Context) | FID-1 | 41.9 | L-Verse |
| 10-shot image generation | COCO (Common Objects in Context) | FID-2 | 35.5 | L-Verse |
| 10-shot image generation | COCO (Common Objects in Context) | FID-4 | 30.2 | L-Verse |
| 10-shot image generation | COCO (Common Objects in Context) | FID-8 | 29.83 | L-Verse |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID | 37.2 | L-Verse-CC |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-1 | 31.6 | L-Verse-CC |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-2 | 25.7 | L-Verse-CC |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-4 | 21.4 | L-Verse-CC |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-8 | 21.1 | L-Verse-CC |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID | 45.8 | L-Verse |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-1 | 41.9 | L-Verse |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-2 | 35.5 | L-Verse |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-4 | 30.2 | L-Verse |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | FID-8 | 29.83 | L-Verse |