TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MinVIS: A Minimal Video Instance Segmentation Framework wi...

MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training

De-An Huang, Zhiding Yu, Anima Anandkumar

2022-08-03SegmentationSemantic SegmentationInstance SegmentationVideo Instance Segmentation
PaperPDFCode(official)Code

Abstract

We propose MinVIS, a minimal video instance segmentation (VIS) framework that achieves state-of-the-art VIS performance with neither video-based architectures nor training procedures. By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP. Since MinVIS treats frames in training videos as independent images, we can drastically sub-sample the annotated frames in training videos without any modifications. With only 1% of labeled frames, MinVIS outperforms or is comparable to fully-supervised state-of-the-art approaches on YouTube-VIS 2019/2021. Our key observation is that queries trained to be discriminative between intra-frame object instances are temporally consistent and can be used to track instances without any manually designed heuristics. MinVIS thus has the following inference pipeline: we first apply the trained query-based image instance segmentation to video frames independently. The segmented instances are then tracked by bipartite matching of the corresponding queries. This inference is done in an online fashion and does not need to process the whole video at once. MinVIS thus has the practical advantages of reducing both the labeling costs and the memory requirements, while not sacrificing the VIS performance. Code is available at: https://github.com/NVlabs/MinVIS

Results

TaskDatasetMetricValueModel
Video Instance SegmentationYouTube-VIS 2021AP5076.6MinVIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AP7562MinVIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR145.9MinVIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021AR1060.8MinVIS (Swin-L)
Video Instance SegmentationYouTube-VIS 2021mask AP55.3MinVIS (Swin-L)
Video Instance SegmentationYouTube-VIS validationAP5083.3MinVIS (Swin-L)
Video Instance SegmentationYouTube-VIS validationAP7568.6MinVIS (Swin-L)
Video Instance SegmentationYouTube-VIS validationAR154.8MinVIS (Swin-L)
Video Instance SegmentationYouTube-VIS validationAR1066.6MinVIS (Swin-L)
Video Instance SegmentationYouTube-VIS validationmask AP61.6MinVIS (Swin-L)
Video Instance SegmentationOVIS validationAP5061.5MinVIS (Swin-L)
Video Instance SegmentationOVIS validationAP7541.3MinVIS (Swin-L)
Video Instance SegmentationOVIS validationAR118.1MinVIS (Swin-L)
Video Instance SegmentationOVIS validationAR1043.3MinVIS (Swin-L)
Video Instance SegmentationOVIS validationmask AP39.4MinVIS (Swin-L)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17From Variability To Accuracy: Conditional Bernoulli Diffusion Models with Consensus-Driven Correction for Thin Structure Segmentation2025-07-17Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17