TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/GIT: A Generative Image-to-text Transformer for Vision and...

GIT: A Generative Image-to-text Transformer for Vision and Language

JianFeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang

2022-05-27Question AnsweringScene Text RecognitionImage ClassificationImage to textVideo CaptioningImage CaptioningVisual Question Answering (VQA)Language ModellingOptical Character Recognition (OCR)
PaperPDFCode(official)

Abstract

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at \url{https://github.com/microsoft/GenerativeImage2Text}.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)MSVD-QAAccuracy0.568GIT
Image Captioningnocaps near-domainB188.9GIT2, Single Model
Image Captioningnocaps near-domainB275.86GIT2, Single Model
Image Captioningnocaps near-domainB358.9GIT2, Single Model
Image Captioningnocaps near-domainB438.95GIT2, Single Model
Image Captioningnocaps near-domainCIDEr125.51GIT2, Single Model
Image Captioningnocaps near-domainMETEOR32.95GIT2, Single Model
Image Captioningnocaps near-domainROUGE-L63.66GIT2, Single Model
Image Captioningnocaps near-domainSPICE16.11GIT2, Single Model
Image Captioningnocaps near-domainB188.56GIT, Single Model
Image Captioningnocaps near-domainB275.48GIT, Single Model
Image Captioningnocaps near-domainB358.46GIT, Single Model
Image Captioningnocaps near-domainB438.44GIT, Single Model
Image Captioningnocaps near-domainCIDEr123.92GIT, Single Model
Image Captioningnocaps near-domainMETEOR32.86GIT, Single Model
Image Captioningnocaps near-domainROUGE-L63.5GIT, Single Model
Image Captioningnocaps near-domainSPICE15.96GIT, Single Model
Image Captioningnocaps entireB188.1GIT, Single Model
Image Captioningnocaps entireB274.81GIT, Single Model
Image Captioningnocaps entireB357.68GIT, Single Model
Image Captioningnocaps entireB437.35GIT, Single Model
Image Captioningnocaps entireCIDEr123.39GIT, Single Model
Image Captioningnocaps entireMETEOR32.5GIT, Single Model
Image Captioningnocaps entireROUGE-L63.12GIT, Single Model
Image Captioningnocaps entireSPICE15.94GIT, Single Model
Image CaptioningCOCO CaptionsBLEU-444.1GIT
Image CaptioningCOCO CaptionsCIDER151.1GIT
Image CaptioningCOCO CaptionsMETEOR32.2GIT
Image CaptioningCOCO CaptionsSPICE26.3GIT
Image Captioningnocaps out-of-domainB186.28GIT2, Single Model
Image Captioningnocaps out-of-domainB271.15GIT2, Single Model
Image Captioningnocaps out-of-domainB352.36GIT2, Single Model
Image Captioningnocaps out-of-domainB430.15GIT2, Single Model
Image Captioningnocaps out-of-domainCIDEr122.27GIT2, Single Model
Image Captioningnocaps out-of-domainMETEOR30.15GIT2, Single Model
Image Captioningnocaps out-of-domainROUGE-L60.91GIT2, Single Model
Image Captioningnocaps out-of-domainSPICE15.62GIT2, Single Model
Image Captioningnocaps out-of-domainB185.99GIT, Single Model
Image Captioningnocaps out-of-domainB271.28GIT, Single Model
Image Captioningnocaps out-of-domainB352.66GIT, Single Model
Image Captioningnocaps out-of-domainB430.04GIT, Single Model
Image Captioningnocaps out-of-domainCIDEr122.04GIT, Single Model
Image Captioningnocaps out-of-domainMETEOR30.45GIT, Single Model
Image Captioningnocaps out-of-domainROUGE-L60.96GIT, Single Model
Image Captioningnocaps out-of-domainSPICE15.7GIT, Single Model
Image Captioningnocaps-XD in-domainB188.86GIT2
Image Captioningnocaps-XD in-domainB275.86GIT2
Image Captioningnocaps-XD in-domainB359.94GIT2
Image Captioningnocaps-XD in-domainB441.1GIT2
Image Captioningnocaps-XD in-domainCIDEr124.18GIT2
Image Captioningnocaps-XD in-domainMETEOR33.83GIT2
Image Captioningnocaps-XD in-domainROUGE-L63.82GIT2
Image Captioningnocaps-XD in-domainSPICE16.36GIT2
Image Captioningnocaps-XD in-domainB188.55GIT
Image Captioningnocaps-XD in-domainB276.1GIT
Image Captioningnocaps-XD in-domainB360.53GIT
Image Captioningnocaps-XD in-domainB441.65GIT
Image Captioningnocaps-XD in-domainCIDEr122.4GIT
Image Captioningnocaps-XD in-domainMETEOR33.41GIT
Image Captioningnocaps-XD in-domainROUGE-L64.02GIT
Image Captioningnocaps-XD in-domainSPICE16.18GIT
Image Captioningnocaps in-domainB188.86GIT2, Single Model
Image Captioningnocaps in-domainB275.86GIT2, Single Model
Image Captioningnocaps in-domainB359.94GIT2, Single Model
Image Captioningnocaps in-domainB441.1GIT2, Single Model
Image Captioningnocaps in-domainCIDEr124.18GIT2, Single Model
Image Captioningnocaps in-domainMETEOR33.83GIT2, Single Model
Image Captioningnocaps in-domainROUGE-L63.82GIT2, Single Model
Image Captioningnocaps in-domainSPICE16.36GIT2, Single Model
Image Captioningnocaps in-domainB188.55GIT, Single Model
Image Captioningnocaps in-domainB276.1GIT, Single Model
Image Captioningnocaps in-domainB360.53GIT, Single Model
Image Captioningnocaps in-domainB441.65GIT, Single Model
Image Captioningnocaps in-domainCIDEr122.4GIT, Single Model
Image Captioningnocaps in-domainMETEOR33.41GIT, Single Model
Image Captioningnocaps in-domainROUGE-L64.02GIT, Single Model
Image Captioningnocaps in-domainSPICE16.18GIT, Single Model
Image Captioningnocaps-XD near-domainB188.9GIT2
Image Captioningnocaps-XD near-domainB275.86GIT2
Image Captioningnocaps-XD near-domainB358.9GIT2
Image Captioningnocaps-XD near-domainB438.95GIT2
Image Captioningnocaps-XD near-domainCIDEr125.51GIT2
Image Captioningnocaps-XD near-domainMETEOR32.95GIT2
Image Captioningnocaps-XD near-domainROUGE-L63.66GIT2
Image Captioningnocaps-XD near-domainSPICE16.11GIT2
Image Captioningnocaps-XD near-domainB188.56GIT
Image Captioningnocaps-XD near-domainB275.48GIT
Image Captioningnocaps-XD near-domainB358.46GIT
Image Captioningnocaps-XD near-domainB438.44GIT
Image Captioningnocaps-XD near-domainCIDEr123.92GIT
Image Captioningnocaps-XD near-domainMETEOR32.86GIT
Image Captioningnocaps-XD near-domainROUGE-L63.5GIT
Image Captioningnocaps-XD near-domainSPICE15.96GIT
Image Captioningnocaps-XD entireB188.43GIT2
Image Captioningnocaps-XD entireB275.02GIT2
Image Captioningnocaps-XD entireB357.87GIT2
Image Captioningnocaps-XD entireB437.65GIT2
Image Captioningnocaps-XD entireCIDEr124.77GIT2
Image Captioningnocaps-XD entireMETEOR32.56GIT2
Image Captioningnocaps-XD entireROUGE-L63.19GIT2
Image Captioningnocaps-XD entireSPICE16.06GIT2
Image Captioningnocaps-XD entireB188.1GIT
Image Captioningnocaps-XD entireB274.81GIT
Image Captioningnocaps-XD entireB357.68GIT
Image Captioningnocaps-XD entireB437.35GIT
Image Captioningnocaps-XD entireCIDEr123.39GIT
Image Captioningnocaps-XD entireMETEOR32.5GIT
Image Captioningnocaps-XD entireROUGE-L63.12GIT
Image Captioningnocaps-XD entireSPICE15.94GIT
Image Captioningnocaps-XD out-of-domainB186.28GIT2
Image Captioningnocaps-XD out-of-domainB271.15GIT2
Image Captioningnocaps-XD out-of-domainB352.36GIT2
Image Captioningnocaps-XD out-of-domainB430.15GIT2
Image Captioningnocaps-XD out-of-domainCIDEr122.27GIT2
Image Captioningnocaps-XD out-of-domainMETEOR30.15GIT2
Image Captioningnocaps-XD out-of-domainROUGE-L60.91GIT2
Image Captioningnocaps-XD out-of-domainSPICE15.62GIT2
Image Captioningnocaps-XD out-of-domainB185.99GIT
Image Captioningnocaps-XD out-of-domainB271.28GIT
Image Captioningnocaps-XD out-of-domainB352.66GIT
Image Captioningnocaps-XD out-of-domainB430.04GIT
Image Captioningnocaps-XD out-of-domainCIDEr122.04GIT
Image Captioningnocaps-XD out-of-domainMETEOR30.45GIT
Image Captioningnocaps-XD out-of-domainROUGE-L60.96GIT
Image Captioningnocaps-XD out-of-domainSPICE15.7GIT
Video CaptioningMSR-VTTBLEU-454.8GIT2
Video CaptioningMSR-VTTCIDEr75.9GIT2
Video CaptioningMSR-VTTGS201.6GIT2
Video CaptioningMSR-VTTMETEOR33.1GIT2
Video CaptioningMSR-VTTROUGE-L68.2GIT2

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17