TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/ViLT: Vision-and-Language Transformer Without Convolution ...

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Wonjae Kim, Bokyung Son, Ildoo Kim

2021-02-05Cross-Modal RetrievalZero-Shot Cross-Modal RetrievalMultimodal Intent RecognitionVisual ReasoningVisual Question Answering (VQA)object-detectionImage Retrieval
PaperPDFCodeCodeCodeCode(official)CodeCode

Abstract

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance. Our code and pre-trained weights are available at https://github.com/dandelin/vilt.

Results

TaskDatasetMetricValueModel
Reading ComprehensionPhotoChatF152.4ViLT
Reading ComprehensionPhotoChatPrecision55.4ViLT
Reading ComprehensionPhotoChatRecall58.9ViLT
Reading ComprehensionMMDialogF155.8ViLT
Visual Question Answering (VQA)VQA v2 test-devAccuracy71.26ViLT-B/32
Visual ReasoningNLVR2 DevAccuracy75.7ViLT-B/32
Visual ReasoningNLVR2 TestAccuracy76.13ViLT-B/32
Image RetrievalPhotoChatR111.5ViLT
Image RetrievalPhotoChatR@1025.6ViLT
Image RetrievalPhotoChatR@533.8ViLT
Image RetrievalPhotoChatSum(R@1,5,10)71ViLT
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@183.5ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1098.6ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@596.7ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@164.4ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1093.8ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@588.7ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@161.5ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1092.7ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@586.3ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@142.7ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1083.1ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@572.9ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@173.2ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@1096.5ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kImage-to-text R@593.6ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@155ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@1089.8ViLT-B/32
Image Retrieval with Multi-Modal QueryFlickr30kText-to-image R@582.5ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@156.5ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@1089.6ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Image-to-text R@582.6ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@140.4ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@1081.1ViLT-B/32
Image Retrieval with Multi-Modal QueryCOCO 2014Text-to-image R@570ViLT-B/32
Intent RecognitionPhotoChatF152.4ViLT
Intent RecognitionPhotoChatPrecision55.4ViLT
Intent RecognitionPhotoChatRecall58.9ViLT
Intent RecognitionMMDialogF155.8ViLT
Cross-Modal Information RetrievalFlickr30kImage-to-text R@183.5ViLT-B/32
Cross-Modal Information RetrievalFlickr30kImage-to-text R@1098.6ViLT-B/32
Cross-Modal Information RetrievalFlickr30kImage-to-text R@596.7ViLT-B/32
Cross-Modal Information RetrievalFlickr30kText-to-image R@164.4ViLT-B/32
Cross-Modal Information RetrievalFlickr30kText-to-image R@1093.8ViLT-B/32
Cross-Modal Information RetrievalFlickr30kText-to-image R@588.7ViLT-B/32
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@161.5ViLT-B/32
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@1092.7ViLT-B/32
Cross-Modal Information RetrievalCOCO 2014Image-to-text R@586.3ViLT-B/32
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@142.7ViLT-B/32
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@1083.1ViLT-B/32
Cross-Modal Information RetrievalCOCO 2014Text-to-image R@572.9ViLT-B/32
Cross-Modal RetrievalFlickr30kImage-to-text R@183.5ViLT-B/32
Cross-Modal RetrievalFlickr30kImage-to-text R@1098.6ViLT-B/32
Cross-Modal RetrievalFlickr30kImage-to-text R@596.7ViLT-B/32
Cross-Modal RetrievalFlickr30kText-to-image R@164.4ViLT-B/32
Cross-Modal RetrievalFlickr30kText-to-image R@1093.8ViLT-B/32
Cross-Modal RetrievalFlickr30kText-to-image R@588.7ViLT-B/32
Cross-Modal RetrievalCOCO 2014Image-to-text R@161.5ViLT-B/32
Cross-Modal RetrievalCOCO 2014Image-to-text R@1092.7ViLT-B/32
Cross-Modal RetrievalCOCO 2014Image-to-text R@586.3ViLT-B/32
Cross-Modal RetrievalCOCO 2014Text-to-image R@142.7ViLT-B/32
Cross-Modal RetrievalCOCO 2014Text-to-image R@1083.1ViLT-B/32
Cross-Modal RetrievalCOCO 2014Text-to-image R@572.9ViLT-B/32

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17