Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee
Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | MOSE | F | 70.8 | DEVA (with OVIS) |
| Video | MOSE | FPS | 25.3 | DEVA (with OVIS) |
| Video | MOSE | J | 62.3 | DEVA (with OVIS) |
| Video | MOSE | J&F | 66.5 | DEVA (with OVIS) |
| Video | MOSE | F | 64.3 | DEVA (no OVIS) |
| Video | MOSE | FPS | 25.3 | DEVA (no OVIS) |
| Video | MOSE | J | 55.8 | DEVA (no OVIS) |
| Video | MOSE | J&F | 60 | DEVA (no OVIS) |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 91 | DEVA |
| Video | DAVIS 2017 (val) | J&F | 87.6 | DEVA |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 84.2 | DEVA |
| Video | DAVIS 2017 (val) | Speed (FPS) | 25.3 | DEVA |
| Video | YouTube-VOS 2019 | FPS | 25.3 | DEVA |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 86.8 | DEVA |
| Video | DAVIS 2017 (test-dev) | FPS | 25.3 | DEVA |
| Video | DAVIS 2017 (test-dev) | J&F | 83.2 | DEVA |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 79.6 | DEVA |
| Video | DAVIS 2017 (test-dev) | J&F | 62.1 | DEVA (EntitySeg) |
| Video | DAVIS 2016 val | F | 90.2 | DEVA (DIS) |
| Video | DAVIS 2016 val | G | 88.9 | DEVA (DIS) |
| Video | DAVIS 2016 val | J | 87.6 | DEVA (DIS) |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 76.4 | DEVA (EntitySeg) |
| Video | DAVIS 2017 (val) | J&F | 73.4 | DEVA (EntitySeg) |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 70.4 | DEVA (EntitySeg) |
| Semantic Segmentation | VIPSeg | STQ | 52.2 | DEVA (Mask2Former - SwinB) |
| Semantic Segmentation | VIPSeg | VPQ | 55 | DEVA (Mask2Former - SwinB) |
| Instance Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 66 | DEVA (ReferFormer) |
| Instance Segmentation | DAVIS 2017 (val) | J&F 1st frame | 66.3 | DEVA (ReferFormer) |
| Video Object Segmentation | MOSE | F | 70.8 | DEVA (with OVIS) |
| Video Object Segmentation | MOSE | FPS | 25.3 | DEVA (with OVIS) |
| Video Object Segmentation | MOSE | J | 62.3 | DEVA (with OVIS) |
| Video Object Segmentation | MOSE | J&F | 66.5 | DEVA (with OVIS) |
| Video Object Segmentation | MOSE | F | 64.3 | DEVA (no OVIS) |
| Video Object Segmentation | MOSE | FPS | 25.3 | DEVA (no OVIS) |
| Video Object Segmentation | MOSE | J | 55.8 | DEVA (no OVIS) |
| Video Object Segmentation | MOSE | J&F | 60 | DEVA (no OVIS) |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 91 | DEVA |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 87.6 | DEVA |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 84.2 | DEVA |
| Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 25.3 | DEVA |
| Video Object Segmentation | YouTube-VOS 2019 | FPS | 25.3 | DEVA |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 86.8 | DEVA |
| Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 25.3 | DEVA |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 83.2 | DEVA |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 79.6 | DEVA |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 62.1 | DEVA (EntitySeg) |
| Video Object Segmentation | DAVIS 2016 val | F | 90.2 | DEVA (DIS) |
| Video Object Segmentation | DAVIS 2016 val | G | 88.9 | DEVA (DIS) |
| Video Object Segmentation | DAVIS 2016 val | J | 87.6 | DEVA (DIS) |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 76.4 | DEVA (EntitySeg) |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 73.4 | DEVA (EntitySeg) |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 70.4 | DEVA (EntitySeg) |
| Referring Expression Segmentation | Refer-YouTube-VOS (2021 public validation) | J&F | 66 | DEVA (ReferFormer) |
| Referring Expression Segmentation | DAVIS 2017 (val) | J&F 1st frame | 66.3 | DEVA (ReferFormer) |
| Semi-Supervised Video Object Segmentation | MOSE | F | 70.8 | DEVA (with OVIS) |
| Semi-Supervised Video Object Segmentation | MOSE | FPS | 25.3 | DEVA (with OVIS) |
| Semi-Supervised Video Object Segmentation | MOSE | J | 62.3 | DEVA (with OVIS) |
| Semi-Supervised Video Object Segmentation | MOSE | J&F | 66.5 | DEVA (with OVIS) |
| Semi-Supervised Video Object Segmentation | MOSE | F | 64.3 | DEVA (no OVIS) |
| Semi-Supervised Video Object Segmentation | MOSE | FPS | 25.3 | DEVA (no OVIS) |
| Semi-Supervised Video Object Segmentation | MOSE | J | 55.8 | DEVA (no OVIS) |
| Semi-Supervised Video Object Segmentation | MOSE | J&F | 60 | DEVA (no OVIS) |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 91 | DEVA |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 87.6 | DEVA |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 84.2 | DEVA |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 25.3 | DEVA |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | FPS | 25.3 | DEVA |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 86.8 | DEVA |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 25.3 | DEVA |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 83.2 | DEVA |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 79.6 | DEVA |
| Video Segmentation | BURST-val | OWTA (all) | 69.9 | DEVA (Mask2Former) |
| Video Segmentation | BURST-val | OWTA (com) | 75.2 | DEVA (Mask2Former) |
| Video Segmentation | BURST-val | OWTA (unc) | 41.5 | DEVA (Mask2Former) |
| Video Segmentation | BURST-val | OWTA (all) | 69.5 | DEVA (EntitySeg) |
| Video Segmentation | BURST-val | OWTA (com) | 73.3 | DEVA (EntitySeg) |
| Video Segmentation | BURST-val | OWTA (unc) | 50.5 | DEVA (EntitySeg) |
| 10-shot image generation | VIPSeg | STQ | 52.2 | DEVA (Mask2Former - SwinB) |
| 10-shot image generation | VIPSeg | VPQ | 55 | DEVA (Mask2Former - SwinB) |
| Panoptic Segmentation | VIPSeg | STQ | 52.2 | DEVA (Mask2Former - SwinB) |
| Panoptic Segmentation | VIPSeg | VPQ | 55 | DEVA (Mask2Former - SwinB) |