Evangelos Skartados, Konstantinos Georgiadis, Mehmet Kerim Yucel, Koskinas Ioannis, Armando Domi, Anastasios Drosou, Bruno Manganelli, Albert Saa-Garriga
Space-time memory (STM) network methods have been dominant in semi-supervised video object segmentation (SVOS) due to their remarkable performance. In this work, we identify three key aspects where we can improve such methods; i) supervisory signal, ii) pretraining and iii) spatial awareness. We then propose TrickVOS; a generic, method-agnostic bag of tricks addressing each aspect with i) a structure-aware hybrid loss, ii) a simple decoder pretraining regime and iii) a cheap tracker that imposes spatial constraints in model predictions. Finally, we propose a lightweight network and show that when trained with TrickVOS, it achieves competitive results to state-of-the-art methods on DAVIS and YouTube benchmarks, while being one of the first STM-based SVOS methods that can run in real-time on a mobile device.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | DAVIS 2016 | F-measure (Mean) | 93.1 | STCN + TrickVOS (PT) |
| Video | DAVIS 2016 | J&F | 91.8 | STCN + TrickVOS (PT) |
| Video | DAVIS 2016 | Jaccard (Mean) | 90.5 | STCN + TrickVOS (PT) |
| Video | DAVIS 2016 | F-measure (Mean) | 89.9 | Lightweight TrickVOS (PT) |
| Video | DAVIS 2016 | J&F | 89.3 | Lightweight TrickVOS (PT) |
| Video | DAVIS 2016 | Jaccard (Mean) | 88.7 | Lightweight TrickVOS (PT) |
| Video | DAVIS 2016 | Speed (FPS) | 86.4 | Lightweight TrickVOS (PT) |
| Video | YouTube-VOS 2019 | F-Measure (Seen) | 86.4 | STCN + TrickVOS (PT) |
| Video | YouTube-VOS 2019 | F-Measure (Unseen) | 85.5 | STCN + TrickVOS (PT) |
| Video | YouTube-VOS 2019 | J&F | 82.8 | STCN + TrickVOS (PT) |
| Video | YouTube-VOS 2019 | Jaccard (Seen) | 82.1 | STCN + TrickVOS (PT) |
| Video | YouTube-VOS 2019 | Jaccard (Unseen) | 77.2 | STCN + TrickVOS (PT) |
| Video | YouTube-VOS 2019 | F-Measure (Seen) | 83.3 | Lightweight TrickVOS (PT) |
| Video | YouTube-VOS 2019 | F-Measure (Unseen) | 84 | Lightweight TrickVOS (PT) |
| Video | YouTube-VOS 2019 | J score (unseen) | 75.2 | Lightweight TrickVOS (PT) |
| Video | YouTube-VOS 2019 | J&F | 80.5 | Lightweight TrickVOS (PT) |
| Video | YouTube-VOS 2019 | Jaccard (Seen) | 79.5 | Lightweight TrickVOS (PT) |
| Video | DAVIS-2017 | F-measure (Mean) | 89.6 | STCN + TrickVOS (PT) |
| Video | DAVIS-2017 | J&F | 86.1 | STCN + TrickVOS (PT) |
| Video | DAVIS-2017 | Jaccard (Mean) | 82.6 | STCN + TrickVOS (PT) |
| Video | DAVIS-2017 | Speed (FPS) | 35.1 | STCN + TrickVOS (PT) |
| Video | DAVIS-2017 | F-measure (Mean) | 86 | Lightweight TrickVOS (PT) |
| Video | DAVIS-2017 | J&F | 82.7 | Lightweight TrickVOS (PT) |
| Video | DAVIS-2017 | Jaccard (Mean) | 79.4 | Lightweight TrickVOS (PT) |
| Video | DAVIS-2017 | Speed (FPS) | 76.4 | Lightweight TrickVOS (PT) |
| Video | DAVIS-2016 | Speed (FPS) | 45.4 | STCN + TrickVOS (PT) |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 93.1 | STCN + TrickVOS (PT) |
| Video Object Segmentation | DAVIS 2016 | J&F | 91.8 | STCN + TrickVOS (PT) |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 90.5 | STCN + TrickVOS (PT) |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 89.9 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | DAVIS 2016 | J&F | 89.3 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 88.7 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 86.4 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | YouTube-VOS 2019 | F-Measure (Seen) | 86.4 | STCN + TrickVOS (PT) |
| Video Object Segmentation | YouTube-VOS 2019 | F-Measure (Unseen) | 85.5 | STCN + TrickVOS (PT) |
| Video Object Segmentation | YouTube-VOS 2019 | J&F | 82.8 | STCN + TrickVOS (PT) |
| Video Object Segmentation | YouTube-VOS 2019 | Jaccard (Seen) | 82.1 | STCN + TrickVOS (PT) |
| Video Object Segmentation | YouTube-VOS 2019 | Jaccard (Unseen) | 77.2 | STCN + TrickVOS (PT) |
| Video Object Segmentation | YouTube-VOS 2019 | F-Measure (Seen) | 83.3 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | YouTube-VOS 2019 | F-Measure (Unseen) | 84 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | YouTube-VOS 2019 | J score (unseen) | 75.2 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | YouTube-VOS 2019 | J&F | 80.5 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | YouTube-VOS 2019 | Jaccard (Seen) | 79.5 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | DAVIS-2017 | F-measure (Mean) | 89.6 | STCN + TrickVOS (PT) |
| Video Object Segmentation | DAVIS-2017 | J&F | 86.1 | STCN + TrickVOS (PT) |
| Video Object Segmentation | DAVIS-2017 | Jaccard (Mean) | 82.6 | STCN + TrickVOS (PT) |
| Video Object Segmentation | DAVIS-2017 | Speed (FPS) | 35.1 | STCN + TrickVOS (PT) |
| Video Object Segmentation | DAVIS-2017 | F-measure (Mean) | 86 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | DAVIS-2017 | J&F | 82.7 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | DAVIS-2017 | Jaccard (Mean) | 79.4 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | DAVIS-2017 | Speed (FPS) | 76.4 | Lightweight TrickVOS (PT) |
| Video Object Segmentation | DAVIS-2016 | Speed (FPS) | 45.4 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 93.1 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 91.8 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 90.5 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 89.9 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 89.3 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 88.7 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 86.4 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | F-Measure (Seen) | 86.4 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | F-Measure (Unseen) | 85.5 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | J&F | 82.8 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | Jaccard (Seen) | 82.1 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | Jaccard (Unseen) | 77.2 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | F-Measure (Seen) | 83.3 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | F-Measure (Unseen) | 84 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | J score (unseen) | 75.2 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | J&F | 80.5 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2019 | Jaccard (Seen) | 79.5 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS-2017 | F-measure (Mean) | 89.6 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS-2017 | J&F | 86.1 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS-2017 | Jaccard (Mean) | 82.6 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS-2017 | Speed (FPS) | 35.1 | STCN + TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS-2017 | F-measure (Mean) | 86 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS-2017 | J&F | 82.7 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS-2017 | Jaccard (Mean) | 79.4 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS-2017 | Speed (FPS) | 76.4 | Lightweight TrickVOS (PT) |
| Semi-Supervised Video Object Segmentation | DAVIS-2016 | Speed (FPS) | 45.4 | STCN + TrickVOS (PT) |