ViLT

Vision-and-Language Transformer

Computer VisionIntroduced 20006 papers

Description

ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.

Papers Using This Method

Seeing More with Less: Human-like Representations in Vision Models2025-01-01 Visual Robustness Benchmark for Visual Question Answering (VQA)2024-07-03 Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis2023-09-21 VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations2022-07-01 Visual Spatial Reasoning2022-04-30 ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision2021-02-05