DVIS: Decoupled Video Instance Segmentation Framework

Tao Zhang, Xingye Tian, Yu Wu, Shunping Ji, Xuebo Wang, Yuan Zhang, Pengfei Wan

2023-06-06ICCV 2023 1Video Editing Video Panoptic Segmentation Segmentation Autonomous Driving Semantic Segmentation Instance Segmentation Video Instance Segmentation

Paper PDF Code(official)

Abstract

Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the \textbf{D}ecoupled \textbf{VIS} framework (\textbf{DVIS}). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 1.69\% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. The code is available at \href{https://github.com/zhang-tao-whu/DVIS}{https://github.com/zhang-tao-whu/DVIS}.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	VIPSeg	STQ	55.3	DVIS(Swin-L)
Semantic Segmentation	VIPSeg	VPQ	57.6	DVIS(Swin-L)
Video Instance Segmentation	YouTube-VIS 2021	AP50	83	DVIS(Swin-L)
Video Instance Segmentation	YouTube-VIS 2021	AP75	68.4	DVIS(Swin-L)
Video Instance Segmentation	YouTube-VIS 2021	AR1	47.7	DVIS(Swin-L)
Video Instance Segmentation	YouTube-VIS 2021	AR10	65.7	DVIS(Swin-L)
Video Instance Segmentation	YouTube-VIS 2021	mask AP	60.1	DVIS(Swin-L)
Video Instance Segmentation	YouTube-VIS validation	AP50	88	DVIS
Video Instance Segmentation	YouTube-VIS validation	AP75	72.7	DVIS
Video Instance Segmentation	YouTube-VIS validation	AR1	56.5	DVIS
Video Instance Segmentation	YouTube-VIS validation	AR10	70.3	DVIS
Video Instance Segmentation	YouTube-VIS validation	mask AP	64.9	DVIS
Video Instance Segmentation	OVIS validation	AP50	75.9	DVIS(Swin-L, Offline)
Video Instance Segmentation	OVIS validation	AP75	53	DVIS(Swin-L, Offline)
Video Instance Segmentation	OVIS validation	AR1	19.4	DVIS(Swin-L, Offline)
Video Instance Segmentation	OVIS validation	AR10	55.3	DVIS(Swin-L, Offline)
Video Instance Segmentation	OVIS validation	mask AP	49.9	DVIS(Swin-L, Offline)
Video Instance Segmentation	OVIS validation	AP50	71.9	DVIS(Swin-L, Online)
Video Instance Segmentation	OVIS validation	AP75	49.2	DVIS(Swin-L, Online)
Video Instance Segmentation	OVIS validation	AR1	19.4	DVIS(Swin-L, Online)
Video Instance Segmentation	OVIS validation	AR10	52.5	DVIS(Swin-L, Online)
Video Instance Segmentation	OVIS validation	mask AP	47.1	DVIS(Swin-L, Online)
Video Instance Segmentation	Youtube-VIS 2022 Validation	AP50_L	69	DVIS(Swin-L)
Video Instance Segmentation	Youtube-VIS 2022 Validation	AP75_L	48.8	DVIS(Swin-L)
Video Instance Segmentation	Youtube-VIS 2022 Validation	AR10_L	51.8	DVIS(Swin-L)
Video Instance Segmentation	Youtube-VIS 2022 Validation	AR1_L	37.2	DVIS(Swin-L)
Video Instance Segmentation	Youtube-VIS 2022 Validation	mAP_L	45.9	DVIS(Swin-L)
10-shot image generation	VIPSeg	STQ	55.3	DVIS(Swin-L)
10-shot image generation	VIPSeg	VPQ	57.6	DVIS(Swin-L)
Panoptic Segmentation	VIPSeg	STQ	55.3	DVIS(Swin-L)
Panoptic Segmentation	VIPSeg	VPQ	57.6	DVIS(Swin-L)

DVIS: Decoupled Video Instance Segmentation Framework

Abstract

Results

Related Papers

DVIS: Decoupled Video Instance Segmentation Framework

Abstract

Results

Related Papers