TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/CTVIS: Consistent Training for Online Video Instance Segme...

CTVIS: Consistent Training for Online Video Instance Segmentation

Kaining Ying, Qing Zhong, Weian Mao, Zhenhua Wang, Hao Chen, Lin Yuanbo Wu, Yifan Liu, Chengxiang Fan, Yunzhi Zhuge, Chunhua Shen

2023-07-24ICCV 2023 1Semantic SegmentationInstance SegmentationVideo Instance Segmentation
PaperPDFCode(official)

Abstract

The discrimination of instance embeddings plays a vital role in associating instances across time for online video instance segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon the contrastive items (CIs), which are sets of anchor/positive/negative embeddings. Recent online VIS methods leverage CIs sourced from one reference frame only, which we argue is insufficient for learning highly discriminative embeddings. Intuitively, a possible strategy to enhance CIs is replicating the inference phase during training. To this end, we propose a simple yet effective training strategy, called Consistent Training for Online VIS (CTVIS), which devotes to aligning the training and inference pipelines in terms of building CIs. Specifically, CTVIS constructs CIs by referring inference the momentum-averaged embedding and the memory bank storage mechanisms, and adding noise to the relevant embeddings. Such an extension allows a reliable comparison between embeddings of current instances and the stable representations of historical instances, thereby conferring an advantage in modeling VIS challenges such as occlusion, re-identification, and deformation. Empirically, CTVIS outstrips the SOTA VIS models by up to +5.0 points on three VIS benchmarks, including YTVIS19 (55.1% AP), YTVIS21 (50.1% AP) and OVIS (35.5% AP). Furthermore, we find that pseudo-videos transformed from images can train robust models surpassing fully-supervised ones.

Results

TaskDatasetMetricValueModel
Video Instance SegmentationOVIS validationAP5071.5CTVIS (Swin-L)
Video Instance SegmentationOVIS validationAP7547.5CTVIS (Swin-L)
Video Instance SegmentationOVIS validationAPho19.1CTVIS (Swin-L)
Video Instance SegmentationOVIS validationAPmo52.1CTVIS (Swin-L)
Video Instance SegmentationOVIS validationmask AP46.9CTVIS (Swin-L)
Video Instance SegmentationOVIS validationAP5060.8CTVIS (ResNet-50)
Video Instance SegmentationOVIS validationAP7534.9CTVIS (ResNet-50)
Video Instance SegmentationOVIS validationAPho16.1CTVIS (ResNet-50)
Video Instance SegmentationOVIS validationAPmo41.9CTVIS (ResNet-50)
Video Instance SegmentationOVIS validationmask AP35.5CTVIS (ResNet-50)
Video Instance SegmentationYoutube-VIS 2022 ValidationmAP_L46.4CTVIS (Swin-L)
Video Instance SegmentationYoutube-VIS 2022 ValidationmAP_L39.4CTVIS (ResNet-50)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation2025-07-16Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping2025-07-15U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV2025-07-15