Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

Junyu Lu, Dixiang Zhang, Songxin Zhang, Zejian Xie, Zhuoyang Song, Cong Lin, Jiaxing Zhang, BingYi Jing, Pingjian Zhang

2023-12-08Referring Expression Comprehension Referring Expression Segmentation Semantic Segmentation Image Captioning Visual Question Answering (VQA)object-detection Object Detection

Paper PDF

Abstract

Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios. However, the absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors. In this paper, we propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration. Building on the foundation of BLIP-2, Lyrics infuses local visual features extracted from a visual refiner that includes image tagging, object detection and semantic segmentation modules into the Querying Transformer, while on the text side, the language inputs equip the boundary boxes and tags derived from the visual refiner. We further introduce a two-stage training scheme, in which the pre-training stage bridges the modality gap through explicit and comprehensive vision-language alignment targets. During the instruction fine-tuning stage, we introduce semantic-aware visual feature extraction, a crucial method that enables the model to extract informative features from concrete visual objects. Our approach achieves robust performance on 13 datasets across various vision-language tasks, and demonstrates promising multi-modal understanding, perception and conversation capabilities in 11 scenario-based benchmark toolkits.

Results

Task	Dataset	Metric	Value	Model
Visual Question Answering (VQA)	OK-VQA	Accuracy	58.2	Lyrics
Visual Question Answering (VQA)	GQA test-dev	Accuracy	62.4	Lyrics
Visual Question Answering (VQA)	VQA v2 test-dev	Accuracy	81.2	Lyrics
Image Captioning	nocaps entire	CIDEr	126.8	Lyrics
Image Captioning	COCO (Common Objects in Context)	CIDEr	121.1	Lyrics

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

Abstract

Results

Related Papers

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

Abstract

Results

Related Papers