Description
VL-T5 is a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation. The model learns to generate labels in text based on the visual and textual inputs. In contrast to other existing methods, the framework unifies tasks as generating text labels conditioned on multimodal inputs. This allows the model to tackle vision-and-language tasks with unified text generation objective. The models use text prefixes to adapt to different tasks.
Papers Using This Method
Mixture of Rationale: Multi-Modal Reasoning Mixture for Visual Question Answering2024-06-03Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation2022-10-20Webly Supervised Concept Expansion for General Purpose Vision Models2022-02-04VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks2021-12-13Unifying Vision-and-Language Tasks via Text Generation2021-02-04