UniVS: Unified and Universal Video Segmentation with Prompts as Queries

Minghan Li, Shuai Li, Xindong Zhang, Lei Zhang

2024-02-28CVPR 2024 1Video Panoptic Segmentation Referring Video Object Segmentation Referring Expression Segmentation Video Segmentation Video Object Segmentation Video Semantic Segmentation Video Instance Segmentation Video Object Tracking

Paper PDF Code(official)

Abstract

Despite the recent advances in unified image segmentation (IS), developing a unified video segmentation (VS) model remains a challenge. This is mainly because generic category-specified VS tasks need to detect all objects and track them across consecutive frames, while prompt-guided VS tasks require re-identifying the target with visual/text prompts throughout the entire video, making it hard to handle the different tasks with the same architecture. We make an attempt to address these issues and present a novel unified VS architecture, namely UniVS, by using prompts as queries. UniVS averages the prompt features of the target from previous frames as its initial query to explicitly decode masks, and introduces a target-wise prompt cross-attention layer in the mask decoder to integrate prompt features in the memory pool. By taking the predicted masks of entities from previous frames as their visual prompts, UniVS converts different VS tasks into prompt-guided target segmentation, eliminating the heuristic inter-frame matching process. Our framework not only unifies the different VS tasks but also naturally achieves universal training and testing, ensuring robust performance across different scenarios. UniVS shows a commendable balance between performance and universality on 10 challenging VS benchmarks, covering video instance, semantic, panoptic, object, and referring segmentation tasks. Code can be found at \url{https://github.com/MinghanLi/UniVS}.

Results

Task	Dataset	Metric	Value	Model
Video	YouTube-VOS 2018	Mean Jaccard & F-Measure	71.5	UniVS(Swin-L)
Video	DAVIS 2017 (val)	F-measure	79.5	UniVS(Swin-L)
Video	DAVIS 2017 (val)	Jaccard	72.8	UniVS(Swin-L)
Video	DAVIS 2017 (val)	Mean Jaccard & F-Measure	76.2	UniVS(Swin-L)
Scene Parsing	VSPW	mIoU	59.8	UniVS(Swin-L)
Semantic Segmentation	VIPSeg	STQ	58.2	UniVS(Swin-L)
Semantic Segmentation	VIPSeg	VPQ	49.3	UniVS(Swin-L)
Video Semantic Segmentation	VSPW	mIoU	59.8	UniVS(Swin-L)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	F	59.5	UniVS(Swin-L)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J	56.8	UniVS(Swin-L)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	58	UniVS(Swin-L)
Instance Segmentation	DAVIS 2017 (val)	J&F Full video	59.4	UniVS(Swin-L)
Video Object Segmentation	YouTube-VOS 2018	Mean Jaccard & F-Measure	71.5	UniVS(Swin-L)
Video Object Segmentation	DAVIS 2017 (val)	F-measure	79.5	UniVS(Swin-L)
Video Object Segmentation	DAVIS 2017 (val)	Jaccard	72.8	UniVS(Swin-L)
Video Object Segmentation	DAVIS 2017 (val)	Mean Jaccard & F-Measure	76.2	UniVS(Swin-L)
Scene Understanding	VSPW	mIoU	59.8	UniVS(Swin-L)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	F	59.5	UniVS(Swin-L)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J	56.8	UniVS(Swin-L)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	58	UniVS(Swin-L)
Referring Expression Segmentation	DAVIS 2017 (val)	J&F Full video	59.4	UniVS(Swin-L)
Video Instance Segmentation	YouTube-VIS 2021	AP50	79.4	UniVS(Swin-L)
Video Instance Segmentation	YouTube-VIS 2021	AP75	63.3	UniVS(Swin-L)
Video Instance Segmentation	YouTube-VIS 2021	AR1	46.2	UniVS(Swin-L)
Video Instance Segmentation	YouTube-VIS 2021	AR10	63.1	UniVS(Swin-L)
Video Instance Segmentation	YouTube-VIS 2021	mask AP	57.9	UniVS(Swin-L)
Video Instance Segmentation	YouTube-VIS validation	AP50	82.1	UniVS(Swin-L)
Video Instance Segmentation	YouTube-VIS validation	AP75	65.3	UniVS(Swin-L)
Video Instance Segmentation	YouTube-VIS validation	AR1	54.7	UniVS(Swin-L)
Video Instance Segmentation	YouTube-VIS validation	AR10	66.8	UniVS(Swin-L)
Video Instance Segmentation	YouTube-VIS validation	mask AP	60	UniVS(Swin-L)
Video Instance Segmentation	OVIS validation	mask AP	41.7	UniVS(Swin-L)
2D Semantic Segmentation	VSPW	mIoU	59.8	UniVS(Swin-L)
10-shot image generation	VIPSeg	STQ	58.2	UniVS(Swin-L)
10-shot image generation	VIPSeg	VPQ	49.3	UniVS(Swin-L)
Panoptic Segmentation	VIPSeg	STQ	58.2	UniVS(Swin-L)
Panoptic Segmentation	VIPSeg	VPQ	49.3	UniVS(Swin-L)

UniVS: Unified and Universal Video Segmentation with Prompts as Queries

Abstract

Results

Related Papers

UniVS: Unified and Universal Video Segmentation with Prompts as Queries

Abstract

Results

Related Papers