Tracking Anything with Decoupled Video Segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee

2023-09-07ICCV 2023 1Unsupervised Video Object Segmentation Semi-Supervised Video Object Segmentation Panoptic Segmentation Video Panoptic Segmentation Referring Video Object Segmentation Referring Expression Segmentation Segmentation Semantic Segmentation Video Segmentation Video Object Segmentation Video Semantic Segmentation Open-World Video Segmentation

Paper PDF Code(official)

Abstract

Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

Results

Task	Dataset	Metric	Value	Model
Video	MOSE	F	70.8	DEVA (with OVIS)
Video	MOSE	FPS	25.3	DEVA (with OVIS)
Video	MOSE	J	62.3	DEVA (with OVIS)
Video	MOSE	J&F	66.5	DEVA (with OVIS)
Video	MOSE	F	64.3	DEVA (no OVIS)
Video	MOSE	FPS	25.3	DEVA (no OVIS)
Video	MOSE	J	55.8	DEVA (no OVIS)
Video	MOSE	J&F	60	DEVA (no OVIS)
Video	DAVIS 2017 (val)	F-measure (Mean)	91	DEVA
Video	DAVIS 2017 (val)	J&F	87.6	DEVA
Video	DAVIS 2017 (val)	Jaccard (Mean)	84.2	DEVA
Video	DAVIS 2017 (val)	Speed (FPS)	25.3	DEVA
Video	YouTube-VOS 2019	FPS	25.3	DEVA
Video	DAVIS 2017 (test-dev)	F-measure (Mean)	86.8	DEVA
Video	DAVIS 2017 (test-dev)	FPS	25.3	DEVA
Video	DAVIS 2017 (test-dev)	J&F	83.2	DEVA
Video	DAVIS 2017 (test-dev)	Jaccard (Mean)	79.6	DEVA
Video	DAVIS 2017 (test-dev)	J&F	62.1	DEVA (EntitySeg)
Video	DAVIS 2016 val	F	90.2	DEVA (DIS)
Video	DAVIS 2016 val	G	88.9	DEVA (DIS)
Video	DAVIS 2016 val	J	87.6	DEVA (DIS)
Video	DAVIS 2017 (val)	F-measure (Mean)	76.4	DEVA (EntitySeg)
Video	DAVIS 2017 (val)	J&F	73.4	DEVA (EntitySeg)
Video	DAVIS 2017 (val)	Jaccard (Mean)	70.4	DEVA (EntitySeg)
Semantic Segmentation	VIPSeg	STQ	52.2	DEVA (Mask2Former - SwinB)
Semantic Segmentation	VIPSeg	VPQ	55	DEVA (Mask2Former - SwinB)
Instance Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	66	DEVA (ReferFormer)
Instance Segmentation	DAVIS 2017 (val)	J&F 1st frame	66.3	DEVA (ReferFormer)
Video Object Segmentation	MOSE	F	70.8	DEVA (with OVIS)
Video Object Segmentation	MOSE	FPS	25.3	DEVA (with OVIS)
Video Object Segmentation	MOSE	J	62.3	DEVA (with OVIS)
Video Object Segmentation	MOSE	J&F	66.5	DEVA (with OVIS)
Video Object Segmentation	MOSE	F	64.3	DEVA (no OVIS)
Video Object Segmentation	MOSE	FPS	25.3	DEVA (no OVIS)
Video Object Segmentation	MOSE	J	55.8	DEVA (no OVIS)
Video Object Segmentation	MOSE	J&F	60	DEVA (no OVIS)
Video Object Segmentation	DAVIS 2017 (val)	F-measure (Mean)	91	DEVA
Video Object Segmentation	DAVIS 2017 (val)	J&F	87.6	DEVA
Video Object Segmentation	DAVIS 2017 (val)	Jaccard (Mean)	84.2	DEVA
Video Object Segmentation	DAVIS 2017 (val)	Speed (FPS)	25.3	DEVA
Video Object Segmentation	YouTube-VOS 2019	FPS	25.3	DEVA
Video Object Segmentation	DAVIS 2017 (test-dev)	F-measure (Mean)	86.8	DEVA
Video Object Segmentation	DAVIS 2017 (test-dev)	FPS	25.3	DEVA
Video Object Segmentation	DAVIS 2017 (test-dev)	J&F	83.2	DEVA
Video Object Segmentation	DAVIS 2017 (test-dev)	Jaccard (Mean)	79.6	DEVA
Video Object Segmentation	DAVIS 2017 (test-dev)	J&F	62.1	DEVA (EntitySeg)
Video Object Segmentation	DAVIS 2016 val	F	90.2	DEVA (DIS)
Video Object Segmentation	DAVIS 2016 val	G	88.9	DEVA (DIS)
Video Object Segmentation	DAVIS 2016 val	J	87.6	DEVA (DIS)
Video Object Segmentation	DAVIS 2017 (val)	F-measure (Mean)	76.4	DEVA (EntitySeg)
Video Object Segmentation	DAVIS 2017 (val)	J&F	73.4	DEVA (EntitySeg)
Video Object Segmentation	DAVIS 2017 (val)	Jaccard (Mean)	70.4	DEVA (EntitySeg)
Referring Expression Segmentation	Refer-YouTube-VOS (2021 public validation)	J&F	66	DEVA (ReferFormer)
Referring Expression Segmentation	DAVIS 2017 (val)	J&F 1st frame	66.3	DEVA (ReferFormer)
Semi-Supervised Video Object Segmentation	MOSE	F	70.8	DEVA (with OVIS)
Semi-Supervised Video Object Segmentation	MOSE	FPS	25.3	DEVA (with OVIS)
Semi-Supervised Video Object Segmentation	MOSE	J	62.3	DEVA (with OVIS)
Semi-Supervised Video Object Segmentation	MOSE	J&F	66.5	DEVA (with OVIS)
Semi-Supervised Video Object Segmentation	MOSE	F	64.3	DEVA (no OVIS)
Semi-Supervised Video Object Segmentation	MOSE	FPS	25.3	DEVA (no OVIS)
Semi-Supervised Video Object Segmentation	MOSE	J	55.8	DEVA (no OVIS)
Semi-Supervised Video Object Segmentation	MOSE	J&F	60	DEVA (no OVIS)
Semi-Supervised Video Object Segmentation	DAVIS 2017 (val)	F-measure (Mean)	91	DEVA
Semi-Supervised Video Object Segmentation	DAVIS 2017 (val)	J&F	87.6	DEVA
Semi-Supervised Video Object Segmentation	DAVIS 2017 (val)	Jaccard (Mean)	84.2	DEVA
Semi-Supervised Video Object Segmentation	DAVIS 2017 (val)	Speed (FPS)	25.3	DEVA
Semi-Supervised Video Object Segmentation	YouTube-VOS 2019	FPS	25.3	DEVA
Semi-Supervised Video Object Segmentation	DAVIS 2017 (test-dev)	F-measure (Mean)	86.8	DEVA
Semi-Supervised Video Object Segmentation	DAVIS 2017 (test-dev)	FPS	25.3	DEVA
Semi-Supervised Video Object Segmentation	DAVIS 2017 (test-dev)	J&F	83.2	DEVA
Semi-Supervised Video Object Segmentation	DAVIS 2017 (test-dev)	Jaccard (Mean)	79.6	DEVA
Video Segmentation	BURST-val	OWTA (all)	69.9	DEVA (Mask2Former)
Video Segmentation	BURST-val	OWTA (com)	75.2	DEVA (Mask2Former)
Video Segmentation	BURST-val	OWTA (unc)	41.5	DEVA (Mask2Former)
Video Segmentation	BURST-val	OWTA (all)	69.5	DEVA (EntitySeg)
Video Segmentation	BURST-val	OWTA (com)	73.3	DEVA (EntitySeg)
Video Segmentation	BURST-val	OWTA (unc)	50.5	DEVA (EntitySeg)
10-shot image generation	VIPSeg	STQ	52.2	DEVA (Mask2Former - SwinB)
10-shot image generation	VIPSeg	VPQ	55	DEVA (Mask2Former - SwinB)
Panoptic Segmentation	VIPSeg	STQ	52.2	DEVA (Mask2Former - SwinB)
Panoptic Segmentation	VIPSeg	VPQ	55	DEVA (Mask2Former - SwinB)

Tracking Anything with Decoupled Video Segmentation

Abstract

Results

Related Papers

Tracking Anything with Decoupled Video Segmentation

Abstract

Results

Related Papers