TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TarViS: A Unified Approach for Target-based Video Segmenta...

TarViS: A Unified Approach for Target-based Video Segmentation

Ali Athar, Alexander Hermans, Jonathon Luiten, Deva Ramanan, Bastian Leibe

2023-01-06CVPR 2023 1Panoptic SegmentationVideo Panoptic SegmentationSegmentationSemantic SegmentationVideo SegmentationVideo Object SegmentationInstance SegmentationVideo Semantic SegmentationVideo Instance Segmentation
PaperPDFCode(official)

Abstract

The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two. Code and model weights are available at: https://github.com/Ali2500/TarViS

Results

TaskDatasetMetricValueModel
VideoDAVIS 2017 (val)F-measure (Mean)88.5TarViS
VideoDAVIS 2017 (val)J&F85.3TarViS
VideoDAVIS 2017 (val)Jaccard (Mean)81.7TarViS
Semantic SegmentationCityscapes-VPSVPQ58.9TarViS (Swin-L)
Semantic SegmentationCityscapes-VPSVPQ (stuff)69.9TarViS (Swin-L)
Semantic SegmentationCityscapes-VPSVPQ (thing)43.7TarViS (Swin-L)
Semantic SegmentationCityscapes-VPSVPQ58TarViS (Swin-T)
Semantic SegmentationCityscapes-VPSVPQ (stuff)69TarViS (Swin-T)
Semantic SegmentationCityscapes-VPSVPQ (thing)42.9TarViS (Swin-T)
Semantic SegmentationCityscapes-VPSVPQ53.3TarViS (ResNet-50)
Semantic SegmentationCityscapes-VPSVPQ (stuff)66TarViS (ResNet-50)
Semantic SegmentationCityscapes-VPSVPQ (thing)35.9TarViS (ResNet-50)
Semantic SegmentationVIPSegSTQ52.9TarViS (Swin-L)
Semantic SegmentationVIPSegVPQ48TarViS (Swin-L)
Semantic SegmentationVIPSegSTQ45.3TarViS (Swin-T)
Semantic SegmentationVIPSegVPQ35.8TarViS (Swin-T)
Semantic SegmentationVIPSegSTQ43.1TarViS (ResNet-50)
Semantic SegmentationVIPSegVPQ33.5TarViS (ResNet-50)
Semantic SegmentationKITTI-STEPAQ72TarViS (Swin-L)
Semantic SegmentationKITTI-STEPSQ72TarViS (Swin-L)
Semantic SegmentationKITTI-STEPSTQ73TarViS (Swin-L)
Semantic SegmentationKITTI-STEPAQ71.2TarViS (Swin-T)
Semantic SegmentationKITTI-STEPSQ69.9TarViS (Swin-T)
Semantic SegmentationKITTI-STEPSTQ70.6TarViS (Swin-T)
Semantic SegmentationKITTI-STEPAQ70.3TarViS (ResNet-50)
Semantic SegmentationKITTI-STEPSQ68.8TarViS (ResNet-50)
Semantic SegmentationKITTI-STEPSTQ69.6TarViS (ResNet-50)
Video Object SegmentationDAVIS 2017 (val)F-measure (Mean)88.5TarViS
Video Object SegmentationDAVIS 2017 (val)J&F85.3TarViS
Video Object SegmentationDAVIS 2017 (val)Jaccard (Mean)81.7TarViS
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)F-measure (Mean)88.5TarViS
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)J&F85.3TarViS
Semi-Supervised Video Object SegmentationDAVIS 2017 (val)Jaccard (Mean)81.7TarViS
Video Instance SegmentationYouTube-VIS 2021AP5081.4TarViS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP7567.6TarViS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR147.6TarViS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR1064.8TarViS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021mask AP60.2TarViS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP5071.6TarViS (Swin-T)
Video Instance SegmentationYouTube-VIS 2021AP7556.6TarViS (Swin-T)
Video Instance SegmentationYouTube-VIS 2021AR142.2TarViS (Swin-T)
Video Instance SegmentationYouTube-VIS 2021AR1057.2TarViS (Swin-T)
Video Instance SegmentationYouTube-VIS 2021mask AP50.9TarViS (Swin-T)
Video Instance SegmentationYouTube-VIS 2021AP5069.6TarViS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AP7553.2TarViS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AR140.5TarViS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021AR1055.9TarViS (ResNet-50)
Video Instance SegmentationYouTube-VIS 2021mask AP48.3TarViS (ResNet-50)
Video Instance SegmentationOVIS validationAP5067.8TarViS (Swin-L)
Video Instance SegmentationOVIS validationAP7544.6TarViS (Swin-L)
Video Instance SegmentationOVIS validationAR118TarViS (Swin-L)
Video Instance SegmentationOVIS validationAR1050.4TarViS (Swin-L)
Video Instance SegmentationOVIS validationmask AP43.2TarViS (Swin-L)
Video Instance SegmentationOVIS validationAP5055TarViS (Swin-T)
Video Instance SegmentationOVIS validationAP7534.4TarViS (Swin-T)
Video Instance SegmentationOVIS validationAR116.1TarViS (Swin-T)
Video Instance SegmentationOVIS validationAR1040.9TarViS (Swin-T)
Video Instance SegmentationOVIS validationmask AP34TarViS (Swin-T)
Video Instance SegmentationOVIS validationAP5052.5TarViS (ResNet-50)
Video Instance SegmentationOVIS validationAP7530.4TarViS (ResNet-50)
Video Instance SegmentationOVIS validationAR115.9TarViS (ResNet-50)
Video Instance SegmentationOVIS validationAR1039.9TarViS (ResNet-50)
Video Instance SegmentationOVIS validationmask AP31.1TarViS (ResNet-50)
10-shot image generationCityscapes-VPSVPQ58.9TarViS (Swin-L)
10-shot image generationCityscapes-VPSVPQ (stuff)69.9TarViS (Swin-L)
10-shot image generationCityscapes-VPSVPQ (thing)43.7TarViS (Swin-L)
10-shot image generationCityscapes-VPSVPQ58TarViS (Swin-T)
10-shot image generationCityscapes-VPSVPQ (stuff)69TarViS (Swin-T)
10-shot image generationCityscapes-VPSVPQ (thing)42.9TarViS (Swin-T)
10-shot image generationCityscapes-VPSVPQ53.3TarViS (ResNet-50)
10-shot image generationCityscapes-VPSVPQ (stuff)66TarViS (ResNet-50)
10-shot image generationCityscapes-VPSVPQ (thing)35.9TarViS (ResNet-50)
10-shot image generationVIPSegSTQ52.9TarViS (Swin-L)
10-shot image generationVIPSegVPQ48TarViS (Swin-L)
10-shot image generationVIPSegSTQ45.3TarViS (Swin-T)
10-shot image generationVIPSegVPQ35.8TarViS (Swin-T)
10-shot image generationVIPSegSTQ43.1TarViS (ResNet-50)
10-shot image generationVIPSegVPQ33.5TarViS (ResNet-50)
10-shot image generationKITTI-STEPAQ72TarViS (Swin-L)
10-shot image generationKITTI-STEPSQ72TarViS (Swin-L)
10-shot image generationKITTI-STEPSTQ73TarViS (Swin-L)
10-shot image generationKITTI-STEPAQ71.2TarViS (Swin-T)
10-shot image generationKITTI-STEPSQ69.9TarViS (Swin-T)
10-shot image generationKITTI-STEPSTQ70.6TarViS (Swin-T)
10-shot image generationKITTI-STEPAQ70.3TarViS (ResNet-50)
10-shot image generationKITTI-STEPSQ68.8TarViS (ResNet-50)
10-shot image generationKITTI-STEPSTQ69.6TarViS (ResNet-50)
Panoptic SegmentationCityscapes-VPSVPQ58.9TarViS (Swin-L)
Panoptic SegmentationCityscapes-VPSVPQ (stuff)69.9TarViS (Swin-L)
Panoptic SegmentationCityscapes-VPSVPQ (thing)43.7TarViS (Swin-L)
Panoptic SegmentationCityscapes-VPSVPQ58TarViS (Swin-T)
Panoptic SegmentationCityscapes-VPSVPQ (stuff)69TarViS (Swin-T)
Panoptic SegmentationCityscapes-VPSVPQ (thing)42.9TarViS (Swin-T)
Panoptic SegmentationCityscapes-VPSVPQ53.3TarViS (ResNet-50)
Panoptic SegmentationCityscapes-VPSVPQ (stuff)66TarViS (ResNet-50)
Panoptic SegmentationCityscapes-VPSVPQ (thing)35.9TarViS (ResNet-50)
Panoptic SegmentationVIPSegSTQ52.9TarViS (Swin-L)
Panoptic SegmentationVIPSegVPQ48TarViS (Swin-L)
Panoptic SegmentationVIPSegSTQ45.3TarViS (Swin-T)
Panoptic SegmentationVIPSegVPQ35.8TarViS (Swin-T)
Panoptic SegmentationVIPSegSTQ43.1TarViS (ResNet-50)
Panoptic SegmentationVIPSegVPQ33.5TarViS (ResNet-50)
Panoptic SegmentationKITTI-STEPAQ72TarViS (Swin-L)
Panoptic SegmentationKITTI-STEPSQ72TarViS (Swin-L)
Panoptic SegmentationKITTI-STEPSTQ73TarViS (Swin-L)
Panoptic SegmentationKITTI-STEPAQ71.2TarViS (Swin-T)
Panoptic SegmentationKITTI-STEPSQ69.9TarViS (Swin-T)
Panoptic SegmentationKITTI-STEPSTQ70.6TarViS (Swin-T)
Panoptic SegmentationKITTI-STEPAQ70.3TarViS (ResNet-50)
Panoptic SegmentationKITTI-STEPSQ68.8TarViS (ResNet-50)
Panoptic SegmentationKITTI-STEPSTQ69.6TarViS (ResNet-50)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17