TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VinVL: Revisiting Visual Representations in Vision-Languag...

VinVL: Revisiting Visual Representations in Vision-Language Models

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao

2021-01-02CVPR 2021 1Image-text matchingImage CaptioningVisual Reasoningobject-detectionObject Detection
PaperPDFCodeCodeCodeCode(official)CodeCodeCode

Abstract

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

Results

TaskDatasetMetricValueModel
Visual Question Answering (VQA)GQA Test2019Accuracy64.65Single Model
Visual Question Answering (VQA)GQA Test2019Binary82.63Single Model
Visual Question Answering (VQA)GQA Test2019Consistency94.35Single Model
Visual Question Answering (VQA)GQA Test2019Distribution4.72Single Model
Visual Question Answering (VQA)GQA Test2019Open48.77Single Model
Visual Question Answering (VQA)GQA Test2019Plausibility84.98Single Model
Visual Question Answering (VQA)GQA Test2019Validity96.62Single Model
Visual Question Answering (VQA)VQA v2 test-stdnumber62.55MSR + MS Cog. Svcs., X10 models
Visual Question Answering (VQA)VQA v2 test-stdother67.87MSR + MS Cog. Svcs., X10 models
Visual Question Answering (VQA)VQA v2 test-stdoverall77.45MSR + MS Cog. Svcs., X10 models
Visual Question Answering (VQA)VQA v2 test-stdyes/no92.38MSR + MS Cog. Svcs., X10 models
Visual Question Answering (VQA)VQA v2 test-stdnumber61.5MSR + MS Cog. Svcs.
Visual Question Answering (VQA)VQA v2 test-stdother66.68MSR + MS Cog. Svcs.
Visual Question Answering (VQA)VQA v2 test-stdoverall76.63MSR + MS Cog. Svcs.
Visual Question Answering (VQA)VQA v2 test-stdyes/no92.04MSR + MS Cog. Svcs.
Image Captioningnocaps near-domainB182.77VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps near-domainB266.94VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps near-domainB347.02VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps near-domainB427.97VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps near-domainCIDEr95.16VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps near-domainMETEOR28.24VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps near-domainROUGE-L57.95VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps near-domainSPICE13.36VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps entireB181.59VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps entireB265.15VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps entireB345.04VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps entireB426.15VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps entireCIDEr92.46VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps entireMETEOR27.57VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps entireROUGE-L56.96VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps entireSPICE13.07VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps-val-out-domainCIDEr88.3VinVL
Image Captioningnocaps-val-out-domainSPICE12.1VinVL
Image Captioningnocaps-val-near-domainCIDEr96.1VinVL
Image Captioningnocaps-val-near-domainSPICE13.8VinVL
Image CaptioningCOCO CaptionsBLEU-441VinVL
Image CaptioningCOCO CaptionsCIDER140.9VinVL
Image CaptioningCOCO CaptionsMETEOR31.1VinVL
Image CaptioningCOCO CaptionsSPICE25.2VinVL
Image Captioningnocaps out-of-domainB175.78VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps out-of-domainB256.1VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps out-of-domainB334.02VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps out-of-domainB415.86VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps out-of-domainCIDEr78.01VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps out-of-domainMETEOR23.55VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps out-of-domainROUGE-L51.99VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps out-of-domainSPICE11.48VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps-val-overallCIDEr95.5VinVL
Image Captioningnocaps-val-overallSPICE13.5VinVL
Image Captioningnocaps in-domainB183.24VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps in-domainB268.04VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps in-domainB349.68VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps in-domainB430.62VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps in-domainCIDEr97.99VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps in-domainMETEOR29.51VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps in-domainROUGE-L58.54VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps in-domainSPICE13.63VinVL (Microsoft Cognitive Services + MSR)
Image Captioningnocaps-val-in-domainCIDEr103.1VinVL
Image Captioningnocaps-val-in-domainSPICE14.2VinVL
Image Retrieval with Multi-Modal QueryCommercialAdsDatasetADD(S) AUC88.56VinVL
Cross-Modal Information RetrievalCommercialAdsDatasetADD(S) AUC88.56VinVL
Cross-Modal RetrievalCommercialAdsDatasetADD(S) AUC88.56VinVL

Related Papers

LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17Decoupled PROB: Decoupled Query Initialization Tasks and Objectness-Class Learning for Open World Object Detection2025-07-17Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios2025-07-16Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning2025-07-15