TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Tracking Anything with Decoupled Video Segmentation

Tracking Anything with Decoupled Video Segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee

2023-09-07ICCV 2023 1Unsupervised Video Object SegmentationSemi-Supervised Video Object SegmentationPanoptic SegmentationVideo Panoptic SegmentationReferring Video Object SegmentationReferring Expression SegmentationSegmentationSemantic SegmentationVideo SegmentationVideo Object SegmentationVideo Semantic SegmentationOpen-World Video Segmentation
PaperPDFCode(official)

Abstract

Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

Results

TaskDatasetMetricValueModel
VideoMOSEF70.8DEVA (with OVIS)
VideoMOSEFPS25.3DEVA (with OVIS)
VideoMOSEJ62.3DEVA (with OVIS)
VideoMOSEJ&F66.5DEVA (with OVIS)
VideoMOSEF64.3DEVA (no OVIS)
VideoMOSEFPS25.3DEVA (no OVIS)
VideoMOSEJ55.8DEVA (no OVIS)
VideoMOSEJ&F60DEVA (no OVIS)
VideoDAVIS 2017 (val)F-measure (Mean)91DEVA
VideoDAVIS 2017 (val)J&F87.6DEVA
VideoDAVIS 2017 (val)Jaccard (Mean)84.2DEVA
VideoDAVIS 2017 (val)Speed (FPS)25.3DEVA
VideoYouTube-VOS 2019FPS25.3DEVA
VideoDAVIS 2017 (test-dev)F-measure (Mean)86.8DEVA
VideoDAVIS 2017 (test-dev)FPS25.3DEVA
VideoDAVIS 2017 (test-dev)J&F83.2DEVA
VideoDAVIS 2017 (test-dev)Jaccard (Mean)79.6DEVA
VideoDAVIS 2017 (test-dev)J&F62.1DEVA (EntitySeg)
VideoDAVIS 2016 valF90.2DEVA (DIS)
VideoDAVIS 2016 valG88.9DEVA (DIS)
VideoDAVIS 2016 valJ87.6DEVA (DIS)
VideoDAVIS 2017 (val)F-measure (Mean)76.4DEVA (EntitySeg)
VideoDAVIS 2017 (val)J&F73.4DEVA (EntitySeg)
VideoDAVIS 2017 (val)Jaccard (Mean)70.4DEVA (EntitySeg)
Semantic SegmentationVIPSegSTQ52.2DEVA (Mask2Former - SwinB)
Semantic SegmentationVIPSegVPQ55DEVA (Mask2Former - SwinB)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F66DEVA (ReferFormer)
Instance SegmentationDAVIS 2017 (val)J&F 1st frame66.3DEVA (ReferFormer)
Video Object SegmentationMOSEF70.8DEVA (with OVIS)
Video Object SegmentationMOSEFPS25.3DEVA (with OVIS)
Video Object SegmentationMOSEJ62.3DEVA (with OVIS)
Video Object SegmentationMOSEJ&F66.5DEVA (with OVIS)
Video Object SegmentationMOSEF64.3DEVA (no OVIS)
Video Object SegmentationMOSEFPS25.3DEVA (no OVIS)
Video Object SegmentationMOSEJ55.8DEVA (no OVIS)
Video Object SegmentationMOSEJ&F60DEVA (no OVIS)
Video Object SegmentationDAVIS 2017 (val)F-measure (Mean)91DEVA
Video Object SegmentationDAVIS 2017 (val)J&F87.6DEVA
Video Object SegmentationDAVIS 2017 (val)Jaccard (Mean)84.2DEVA
Video Object SegmentationDAVIS 2017 (val)Speed (FPS)25.3DEVA
Video Object SegmentationYouTube-VOS 2019FPS25.3DEVA
Video Object SegmentationDAVIS 2017 (test-dev)F-measure (Mean)86.8DEVA
Video Object SegmentationDAVIS 2017 (test-dev)FPS25.3DEVA
Video Object SegmentationDAVIS 2017 (test-dev)J&F83.2DEVA
Video Object SegmentationDAVIS 2017 (test-dev)Jaccard (Mean)79.6DEVA
Video Object SegmentationDAVIS 2017 (test-dev)J&F62.1DEVA (EntitySeg)
Video Object SegmentationDAVIS 2016 valF90.2DEVA (DIS)
Video Object SegmentationDAVIS 2016 valG88.9DEVA (DIS)
Video Object SegmentationDAVIS 2016 valJ87.6DEVA (DIS)
Video Object SegmentationDAVIS 2017 (val)F-measure (Mean)76.4DEVA (EntitySeg)
Video Object SegmentationDAVIS 2017 (val)J&F73.4DEVA (EntitySeg)
Video Object SegmentationDAVIS 2017 (val)Jaccard (Mean)70.4DEVA (EntitySeg)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F66DEVA (ReferFormer)
Referring Expression SegmentationDAVIS 2017 (val)J&F 1st frame66.3DEVA (ReferFormer)
Semi-Supervised Video Object SegmentationMOSEF70.8DEVA (with OVIS)
Semi-Supervised Video Object SegmentationMOSEFPS25.3DEVA (with OVIS)
Semi-Supervised Video Object SegmentationMOSEJ62.3DEVA (with OVIS)
Semi-Supervised Video Object SegmentationMOSEJ&F66.5DEVA (with OVIS)
Semi-Supervised Video Object SegmentationMOSEF64.3DEVA (no OVIS)
Semi-Supervised Video Object SegmentationMOSEFPS25.3DEVA (no OVIS)
Semi-Supervised Video Object SegmentationMOSEJ55.8DEVA (no OVIS)
Semi-Supervised Video Object SegmentationMOSEJ&F60DEVA (no OVIS)
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)F-measure (Mean)91DEVA
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)J&F87.6DEVA
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)Jaccard (Mean)84.2DEVA
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)Speed (FPS)25.3DEVA
Semi-Supervised Video Object SegmentationYouTube-VOS 2019FPS25.3DEVA
Semi-Supervised Video Object SegmentationDAVIS 2017 (test-dev)F-measure (Mean)86.8DEVA
Semi-Supervised Video Object SegmentationDAVIS 2017 (test-dev)FPS25.3DEVA
Semi-Supervised Video Object SegmentationDAVIS 2017 (test-dev)J&F83.2DEVA
Semi-Supervised Video Object SegmentationDAVIS 2017 (test-dev)Jaccard (Mean)79.6DEVA
Video SegmentationBURST-valOWTA (all)69.9DEVA (Mask2Former)
Video SegmentationBURST-valOWTA (com)75.2DEVA (Mask2Former)
Video SegmentationBURST-valOWTA (unc)41.5DEVA (Mask2Former)
Video SegmentationBURST-valOWTA (all)69.5DEVA (EntitySeg)
Video SegmentationBURST-valOWTA (com)73.3DEVA (EntitySeg)
Video SegmentationBURST-valOWTA (unc)50.5DEVA (EntitySeg)
10-shot image generationVIPSegSTQ52.2DEVA (Mask2Former - SwinB)
10-shot image generationVIPSegVPQ55DEVA (Mask2Former - SwinB)
Panoptic SegmentationVIPSegSTQ52.2DEVA (Mask2Former - SwinB)
Panoptic SegmentationVIPSegVPQ55DEVA (Mask2Former - SwinB)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17