VISA: Reasoning Video Object Segmentation via Large Language Models

Cilin Yan, Haochen Wang, Shilin Yan, XiaoLong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, Efstratios Gavves

2024-07-16Referring Video Object Segmentation Segmentation Semantic Segmentation Video Segmentation Video Object Segmentation World Knowledge Video Semantic Segmentation

Paper PDF Code(official)Code(official)

Abstract

Existing Video Object Segmentation (VOS) relies on explicit user instructions, such as categories, masks, or short phrases, restricting their ability to perform complex video segmentation requiring reasoning with world knowledge. In this paper, we introduce a new task, Reasoning Video Object Segmentation (ReasonVOS). This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities based on world knowledge and video contexts, which is crucial for structured environment understanding and object-centric interactions, pivotal in the development of embodied AI. To tackle ReasonVOS, we introduce VISA (Video-based large language Instructed Segmentation Assistant), to leverage the world knowledge reasoning capabilities of multi-modal LLMs while possessing the ability to segment and track objects in videos with a mask decoder. Moreover, we establish a comprehensive benchmark consisting of 35,074 instruction-mask sequence pairs from 1,042 diverse videos, which incorporates complex world knowledge reasoning into segmentation tasks for instruction-tuning and evaluation purposes of ReasonVOS models. Experiments conducted on 8 datasets demonstrate the effectiveness of VISA in tackling complex reasoning segmentation and vanilla referring segmentation in both video and image domains. The code and dataset are available at https://github.com/cilinyan/VISA.

Results

Task	Dataset	Metric	Value	Model
Video	ReVOS	F	52.9	VISA (Chat-UniVi-13B)
Video	ReVOS	J	48.8	VISA (Chat-UniVi-13B)
Video	ReVOS	J&F	50.9	VISA (Chat-UniVi-13B)
Video	ReVOS	R	14.5	VISA (Chat-UniVi-13B)
Video	ReVOS	F	49	VISA (Chat-UniVi-7B)
Video	ReVOS	J	44.9	VISA (Chat-UniVi-7B)
Video	ReVOS	J&F	46.9	VISA (Chat-UniVi-7B)
Video	ReVOS	R	15.5	VISA (Chat-UniVi-7B)
Video Object Segmentation	ReVOS	F	52.9	VISA (Chat-UniVi-13B)
Video Object Segmentation	ReVOS	J	48.8	VISA (Chat-UniVi-13B)
Video Object Segmentation	ReVOS	J&F	50.9	VISA (Chat-UniVi-13B)
Video Object Segmentation	ReVOS	R	14.5	VISA (Chat-UniVi-13B)
Video Object Segmentation	ReVOS	F	49	VISA (Chat-UniVi-7B)
Video Object Segmentation	ReVOS	J	44.9	VISA (Chat-UniVi-7B)
Video Object Segmentation	ReVOS	J&F	46.9	VISA (Chat-UniVi-7B)
Video Object Segmentation	ReVOS	R	15.5	VISA (Chat-UniVi-7B)

VISA: Reasoning Video Object Segmentation via Large Language Models

Abstract

Results

Related Papers

VISA: Reasoning Video Object Segmentation via Large Language Models

Abstract

Results

Related Papers