VinVL: Revisiting Visual Representations in Vision-Language Models

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao

2021-01-02CVPR 2021 1Image-text matching Image Captioning Visual Reasoning object-detection Object Detection

Paper PDF Code Code Code Code(official)Code Code Code

Abstract

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	GQA Test2019	Accuracy	64.65	Single Model
Visual Question Answering (VQA)	GQA Test2019	Binary	82.63	Single Model
Visual Question Answering (VQA)	GQA Test2019	Consistency	94.35	Single Model
Visual Question Answering (VQA)	GQA Test2019	Distribution	4.72	Single Model
Visual Question Answering (VQA)	GQA Test2019	Open	48.77	Single Model
Visual Question Answering (VQA)	GQA Test2019	Plausibility	84.98	Single Model
Visual Question Answering (VQA)	GQA Test2019	Validity	96.62	Single Model
Visual Question Answering (VQA)	VQA v2 test-std	number	62.55	MSR + MS Cog. Svcs., X10 models
Visual Question Answering (VQA)	VQA v2 test-std	other	67.87	MSR + MS Cog. Svcs., X10 models
Visual Question Answering (VQA)	VQA v2 test-std	overall	77.45	MSR + MS Cog. Svcs., X10 models
Visual Question Answering (VQA)	VQA v2 test-std	yes/no	92.38	MSR + MS Cog. Svcs., X10 models
Visual Question Answering (VQA)	VQA v2 test-std	number	61.5	MSR + MS Cog. Svcs.
Visual Question Answering (VQA)	VQA v2 test-std	other	66.68	MSR + MS Cog. Svcs.
Visual Question Answering (VQA)	VQA v2 test-std	overall	76.63	MSR + MS Cog. Svcs.
Visual Question Answering (VQA)	VQA v2 test-std	yes/no	92.04	MSR + MS Cog. Svcs.
Image Captioning	nocaps near-domain	B1	82.77	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps near-domain	B2	66.94	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps near-domain	B3	47.02	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps near-domain	B4	27.97	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps near-domain	CIDEr	95.16	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps near-domain	METEOR	28.24	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps near-domain	ROUGE-L	57.95	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps near-domain	SPICE	13.36	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps entire	B1	81.59	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps entire	B2	65.15	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps entire	B3	45.04	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps entire	B4	26.15	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps entire	CIDEr	92.46	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps entire	METEOR	27.57	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps entire	ROUGE-L	56.96	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps entire	SPICE	13.07	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps-val-out-domain	CIDEr	88.3	VinVL
Image Captioning	nocaps-val-out-domain	SPICE	12.1	VinVL
Image Captioning	nocaps-val-near-domain	CIDEr	96.1	VinVL
Image Captioning	nocaps-val-near-domain	SPICE	13.8	VinVL
Image Captioning	COCO Captions	BLEU-4	41	VinVL
Image Captioning	COCO Captions	CIDER	140.9	VinVL
Image Captioning	COCO Captions	METEOR	31.1	VinVL
Image Captioning	COCO Captions	SPICE	25.2	VinVL
Image Captioning	nocaps out-of-domain	B1	75.78	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps out-of-domain	B2	56.1	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps out-of-domain	B3	34.02	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps out-of-domain	B4	15.86	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps out-of-domain	CIDEr	78.01	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps out-of-domain	METEOR	23.55	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps out-of-domain	ROUGE-L	51.99	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps out-of-domain	SPICE	11.48	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps-val-overall	CIDEr	95.5	VinVL
Image Captioning	nocaps-val-overall	SPICE	13.5	VinVL
Image Captioning	nocaps in-domain	B1	83.24	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps in-domain	B2	68.04	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps in-domain	B3	49.68	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps in-domain	B4	30.62	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps in-domain	CIDEr	97.99	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps in-domain	METEOR	29.51	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps in-domain	ROUGE-L	58.54	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps in-domain	SPICE	13.63	VinVL (Microsoft Cognitive Services + MSR)
Image Captioning	nocaps-val-in-domain	CIDEr	103.1	VinVL
Image Captioning	nocaps-val-in-domain	SPICE	14.2	VinVL
Image Retrieval with Multi-Modal Query	CommercialAdsDataset	ADD(S) AUC	88.56	VinVL
Cross-Modal Information Retrieval	CommercialAdsDataset	ADD(S) AUC	88.56	VinVL
Cross-Modal Retrieval	CommercialAdsDataset	ADD(S) AUC	88.56	VinVL

VinVL: Revisiting Visual Representations in Vision-Language Models

Abstract

Results

Related Papers

VinVL: Revisiting Visual Representations in Vision-Language Models

Abstract

Results

Related Papers