TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UniVS: Unified and Universal Video Segmentation with Promp...

UniVS: Unified and Universal Video Segmentation with Prompts as Queries

Minghan Li, Shuai Li, Xindong Zhang, Lei Zhang

2024-02-28CVPR 2024 1Video Panoptic SegmentationReferring Video Object SegmentationReferring Expression SegmentationVideo SegmentationVideo Object SegmentationVideo Semantic SegmentationVideo Instance SegmentationVideo Object Tracking
PaperPDFCode(official)

Abstract

Despite the recent advances in unified image segmentation (IS), developing a unified video segmentation (VS) model remains a challenge. This is mainly because generic category-specified VS tasks need to detect all objects and track them across consecutive frames, while prompt-guided VS tasks require re-identifying the target with visual/text prompts throughout the entire video, making it hard to handle the different tasks with the same architecture. We make an attempt to address these issues and present a novel unified VS architecture, namely UniVS, by using prompts as queries. UniVS averages the prompt features of the target from previous frames as its initial query to explicitly decode masks, and introduces a target-wise prompt cross-attention layer in the mask decoder to integrate prompt features in the memory pool. By taking the predicted masks of entities from previous frames as their visual prompts, UniVS converts different VS tasks into prompt-guided target segmentation, eliminating the heuristic inter-frame matching process. Our framework not only unifies the different VS tasks but also naturally achieves universal training and testing, ensuring robust performance across different scenarios. UniVS shows a commendable balance between performance and universality on 10 challenging VS benchmarks, covering video instance, semantic, panoptic, object, and referring segmentation tasks. Code can be found at \url{https://github.com/MinghanLi/UniVS}.

Results

TaskDatasetMetricValueModel
VideoYouTube-VOS 2018Mean Jaccard & F-Measure71.5UniVS(Swin-L)
VideoDAVIS 2017 (val)F-measure79.5UniVS(Swin-L)
VideoDAVIS 2017 (val)Jaccard72.8UniVS(Swin-L)
VideoDAVIS 2017 (val)Mean Jaccard & F-Measure76.2UniVS(Swin-L)
Scene ParsingVSPWmIoU59.8UniVS(Swin-L)
Semantic SegmentationVIPSegSTQ58.2UniVS(Swin-L)
Semantic SegmentationVIPSegVPQ49.3UniVS(Swin-L)
Video Semantic SegmentationVSPWmIoU59.8UniVS(Swin-L)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)F59.5UniVS(Swin-L)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J56.8UniVS(Swin-L)
Instance SegmentationRefer-YouTube-VOS (2021 public validation)J&F58UniVS(Swin-L)
Instance SegmentationDAVIS 2017 (val)J&F Full video59.4UniVS(Swin-L)
Video Object SegmentationYouTube-VOS 2018Mean Jaccard & F-Measure71.5UniVS(Swin-L)
Video Object SegmentationDAVIS 2017 (val)F-measure79.5UniVS(Swin-L)
Video Object SegmentationDAVIS 2017 (val)Jaccard72.8UniVS(Swin-L)
Video Object SegmentationDAVIS 2017 (val)Mean Jaccard & F-Measure76.2UniVS(Swin-L)
Scene UnderstandingVSPWmIoU59.8UniVS(Swin-L)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)F59.5UniVS(Swin-L)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J56.8UniVS(Swin-L)
Referring Expression SegmentationRefer-YouTube-VOS (2021 public validation)J&F58UniVS(Swin-L)
Referring Expression SegmentationDAVIS 2017 (val)J&F Full video59.4UniVS(Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP5079.4UniVS(Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP7563.3UniVS(Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR146.2UniVS(Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR1063.1UniVS(Swin-L)
Video Instance SegmentationYouTube-VIS 2021mask AP57.9UniVS(Swin-L)
Video Instance SegmentationYouTube-VIS validationAP5082.1UniVS(Swin-L)
Video Instance SegmentationYouTube-VIS validationAP7565.3UniVS(Swin-L)
Video Instance SegmentationYouTube-VIS validationAR154.7UniVS(Swin-L)
Video Instance SegmentationYouTube-VIS validationAR1066.8UniVS(Swin-L)
Video Instance SegmentationYouTube-VIS validationmask AP60UniVS(Swin-L)
Video Instance SegmentationOVIS validationmask AP41.7UniVS(Swin-L)
2D Semantic SegmentationVSPWmIoU59.8UniVS(Swin-L)
10-shot image generationVIPSegSTQ58.2UniVS(Swin-L)
10-shot image generationVIPSegVPQ49.3UniVS(Swin-L)
Panoptic SegmentationVIPSegSTQ58.2UniVS(Swin-L)
Panoptic SegmentationVIPSegVPQ49.3UniVS(Swin-L)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Memory-Augmented SAM2 for Training-Free Surgical Video Segmentation2025-07-13MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation2025-07-10HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking2025-07-10Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy2025-07-02Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval2025-06-28Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder2025-06-28