Ali Athar, Sabarinath Mahadevan, Aljoša Ošep, Laura Leal-Taixé, Bastian Leibe
Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. In particular, we model a video clip as a single 3D spatio-temporal volume, and propose a novel approach that segments and tracks instances across space and time in a single stage. Our problem formulation is centered around the idea of spatio-temporal embeddings which are trained to cluster pixels belonging to a specific object instance over an entire video clip. To this end, we introduce (i) novel mixing functions that enhance the feature representation of spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that can reason about temporal context. Our network is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster these embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks. Code and models are available at https://github.com/sabarim/STEm-Seg.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | DAVIS 2017 (val) | F-measure (Mean) | 67.8 | STEm-Seg |
| Video | DAVIS 2017 (val) | F-measure (Recall) | 75.5 | STEm-Seg |
| Video | DAVIS 2017 (val) | J&F | 64.7 | STEm-Seg |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 61.5 | STEm-Seg |
| Video | DAVIS 2017 (val) | Jaccard (Recall) | 70.4 | STEm-Seg |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 67.8 | STEm-Seg |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Recall) | 75.5 | STEm-Seg |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 64.7 | STEm-Seg |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 61.5 | STEm-Seg |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Recall) | 70.4 | STEm-Seg |
| Video Instance Segmentation | YouTube-VIS validation | AP50 | 55.8 | STEm-Seg (ResNet-101) |
| Video Instance Segmentation | YouTube-VIS validation | AP75 | 37.9 | STEm-Seg (ResNet-101) |
| Video Instance Segmentation | YouTube-VIS validation | AR1 | 34.4 | STEm-Seg (ResNet-101) |
| Video Instance Segmentation | YouTube-VIS validation | AR10 | 41.6 | STEm-Seg (ResNet-101) |
| Video Instance Segmentation | YouTube-VIS validation | mask AP | 34.6 | STEm-Seg (ResNet-101) |
| Video Instance Segmentation | YouTube-VIS validation | AP50 | 50.7 | STEm-Seg (ResNet-50) |
| Video Instance Segmentation | YouTube-VIS validation | AP75 | 37.9 | STEm-Seg (ResNet-50) |
| Video Instance Segmentation | YouTube-VIS validation | AR1 | 34.4 | STEm-Seg (ResNet-50) |
| Video Instance Segmentation | YouTube-VIS validation | AR10 | 41.6 | STEm-Seg (ResNet-50) |
| Video Instance Segmentation | YouTube-VIS validation | mask AP | 30.6 | STEm-Seg (ResNet-50) |