RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao

2023-08-18ICCV 2023 1Scene Graph Generation Human-Object Interaction Detection Relational Reasoning Zero-Shot Human-Object Interaction Detection object-detection Object Detection Graph Generation

Paper PDF Code Code(official)Code

Abstract

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

Results

Task	Dataset	Metric	Value	Model
Human-Object Interaction Detection	V-COCO	AP(S1)	72.1	RLIPv2
Human-Object Interaction Detection	V-COCO	AP(S2)	74.1	RLIPv2
Human-Object Interaction Detection	HICO-DET	mAP	45.09	RLIPv2 (Swin-L)

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Abstract

Results

Related Papers

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Abstract

Results

Related Papers