TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VISA: Reasoning Video Object Segmentation via Large Langua...

VISA: Reasoning Video Object Segmentation via Large Language Models

Cilin Yan, Haochen Wang, Shilin Yan, XiaoLong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, Efstratios Gavves

2024-07-16Referring Video Object SegmentationSegmentationSemantic SegmentationVideo SegmentationVideo Object SegmentationWorld KnowledgeVideo Semantic Segmentation
PaperPDFCode(official)Code(official)

Abstract

Existing Video Object Segmentation (VOS) relies on explicit user instructions, such as categories, masks, or short phrases, restricting their ability to perform complex video segmentation requiring reasoning with world knowledge. In this paper, we introduce a new task, Reasoning Video Object Segmentation (ReasonVOS). This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities based on world knowledge and video contexts, which is crucial for structured environment understanding and object-centric interactions, pivotal in the development of embodied AI. To tackle ReasonVOS, we introduce VISA (Video-based large language Instructed Segmentation Assistant), to leverage the world knowledge reasoning capabilities of multi-modal LLMs while possessing the ability to segment and track objects in videos with a mask decoder. Moreover, we establish a comprehensive benchmark consisting of 35,074 instruction-mask sequence pairs from 1,042 diverse videos, which incorporates complex world knowledge reasoning into segmentation tasks for instruction-tuning and evaluation purposes of ReasonVOS models. Experiments conducted on 8 datasets demonstrate the effectiveness of VISA in tackling complex reasoning segmentation and vanilla referring segmentation in both video and image domains. The code and dataset are available at https://github.com/cilinyan/VISA.

Results

TaskDatasetMetricValueModel
VideoReVOSF52.9VISA (Chat-UniVi-13B)
VideoReVOSJ48.8VISA (Chat-UniVi-13B)
VideoReVOSJ&F50.9VISA (Chat-UniVi-13B)
VideoReVOSR14.5VISA (Chat-UniVi-13B)
VideoReVOSF49VISA (Chat-UniVi-7B)
VideoReVOSJ44.9VISA (Chat-UniVi-7B)
VideoReVOSJ&F46.9VISA (Chat-UniVi-7B)
VideoReVOSR15.5VISA (Chat-UniVi-7B)
Video Object SegmentationReVOSF52.9VISA (Chat-UniVi-13B)
Video Object SegmentationReVOSJ48.8VISA (Chat-UniVi-13B)
Video Object SegmentationReVOSJ&F50.9VISA (Chat-UniVi-13B)
Video Object SegmentationReVOSR14.5VISA (Chat-UniVi-13B)
Video Object SegmentationReVOSF49VISA (Chat-UniVi-7B)
Video Object SegmentationReVOSJ44.9VISA (Chat-UniVi-7B)
Video Object SegmentationReVOSJ&F46.9VISA (Chat-UniVi-7B)
Video Object SegmentationReVOSR15.5VISA (Chat-UniVi-7B)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17