HyperSeg: Towards Universal Visual Segmentation with Large Language Model

Cong Wei, Yujie Zhong, Haoxian Tan, Yong liu, Zheng Zhao, Jie Hu, Yujiu Yang

2024-11-26Panoptic Segmentation Open Vocabulary Semantic Segmentation Referring Video Object Segmentation Referring Expression Segmentation Segmentation Semantic Segmentation Video Object Segmentation World Knowledge Large Language Model

Paper PDF Code(official)

Abstract

This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.

Results

Task	Dataset	Metric	Value	Model
Video	Refer-YouTube-VOS	J&F	68.5	HyperSeg
Semantic Segmentation	COCO (Common Objects in Context)	mIoU	77.2	HyperSeg
Semantic Segmentation	COCO minival	PQ	61.2	HyperSeg (Swin-B)
Instance Segmentation	RefCOCO testA	Overall IoU	85.7	HyperSeg
Instance Segmentation	RefCoCo val	Overall IoU	84.8	HyperSeg
Instance Segmentation	RefCOCO testB	Overall IoU	83.4	HyperSeg
Instance Segmentation	RefCOCOg-test	Overall IoU	78.9	HyperSeg
Instance Segmentation	RefCOCO+ val	Overall IoU	79	HyperSeg
Instance Segmentation	RefCOCO+ test B	Overall IoU	75.2	HyperSeg
Instance Segmentation	DAVIS 2017 (val)	J&F 1st frame	71.2	HyperSeg
Instance Segmentation	RefCOCO+ testA	Overall IoU	83.5	HyperSeg
Instance Segmentation	RefCOCOg-val	Overall IoU	79.4	HyperSeg
Video Object Segmentation	Refer-YouTube-VOS	J&F	68.5	HyperSeg
Referring Expression Segmentation	RefCOCO testA	Overall IoU	85.7	HyperSeg
Referring Expression Segmentation	RefCoCo val	Overall IoU	84.8	HyperSeg
Referring Expression Segmentation	RefCOCO testB	Overall IoU	83.4	HyperSeg
Referring Expression Segmentation	RefCOCOg-test	Overall IoU	78.9	HyperSeg
Referring Expression Segmentation	RefCOCO+ val	Overall IoU	79	HyperSeg
Referring Expression Segmentation	RefCOCO+ test B	Overall IoU	75.2	HyperSeg
Referring Expression Segmentation	DAVIS 2017 (val)	J&F 1st frame	71.2	HyperSeg
Referring Expression Segmentation	RefCOCO+ testA	Overall IoU	83.5	HyperSeg
Referring Expression Segmentation	RefCOCOg-val	Overall IoU	79.4	HyperSeg
Open Vocabulary Semantic Segmentation	PascalVOC-20	mIoU	92.1	HyperSeg
Open Vocabulary Semantic Segmentation	PASCAL Context-59	mIoU	64.6	HyperSeg
10-shot image generation	COCO (Common Objects in Context)	mIoU	77.2	HyperSeg
10-shot image generation	COCO minival	PQ	61.2	HyperSeg (Swin-B)
Panoptic Segmentation	COCO minival	PQ	61.2	HyperSeg (Swin-B)

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

Abstract

Results

Related Papers

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

Abstract

Results

Related Papers