Description
ViLT is a minimal vision-and-language pre-training transformer model where processing of visual inputs is simplified to just the same convolution-free manner that text inputs are processed. The model-specific components of ViLT require less computation than the transformer component for multimodal interactions. ViLTThe model is pre-trained on the following objectives: image text matching, masked language modeling, and word patch alignment.
Papers Using This Method
Seeing More with Less: Human-like Representations in Vision Models2025-01-01Visual Robustness Benchmark for Visual Question Answering (VQA)2024-07-03Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis2023-09-21VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations2022-07-01Visual Spatial Reasoning2022-04-30ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision2021-02-05