Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, Philip H. S. Torr
In this paper we illustrate how to perform both visual object tracking and semi-supervised video object segmentation, in real-time, with a single simple approach. Our method, dubbed SiamMask, improves the offline training procedure of popular fully-convolutional Siamese approaches for object tracking by augmenting their loss with a binary segmentation task. Once trained, SiamMask solely relies on a single bounding box initialisation and operates online, producing class-agnostic object segmentation masks and rotated bounding boxes at 55 frames per second. Despite its simplicity, versatility and fast speed, our strategy allows us to establish a new state of the art among real-time trackers on VOT-2018, while at the same time demonstrating competitive performance and the best speed for the semi-supervised video object segmentation task on DAVIS-2016 and DAVIS-2017. The project website is http://www.robots.ox.ac.uk/~qwang/SiamMask.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | DAVIS 2017 (val) | F-measure (Decay) | 20.9 | SiamMask |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 58.5 | SiamMask |
| Video | DAVIS 2017 (val) | F-measure (Recall) | 67.5 | SiamMask |
| Video | DAVIS 2017 (val) | J&F | 56.4 | SiamMask |
| Video | DAVIS 2017 (val) | Jaccard (Decay) | 19.3 | SiamMask |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 54.3 | SiamMask |
| Video | DAVIS 2017 (val) | Jaccard (Recall) | 62.8 | SiamMask |
| Video | DAVIS 2016 | F-measure (Decay) | 2.1 | SiamMask |
| Video | DAVIS 2016 | F-measure (Mean) | 67.8 | SiamMask |
| Video | DAVIS 2016 | F-measure (Recall) | 79.8 | SiamMask |
| Video | DAVIS 2016 | J&F | 69.75 | SiamMask |
| Video | DAVIS 2016 | Jaccard (Decay) | 3 | SiamMask |
| Video | DAVIS 2016 | Jaccard (Mean) | 71.7 | SiamMask |
| Video | DAVIS 2016 | Jaccard (Recall) | 86.8 | SiamMask |
| Video | DAVIS 2017 (test-dev) | F-measure (Decay) | 22.4 | SiamMask |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 45.8 | SiamMask |
| Video | DAVIS 2017 (test-dev) | F-measure (Recall) | 45.3 | SiamMask |
| Video | DAVIS 2017 (test-dev) | J&F | 43.2 | SiamMask |
| Video | DAVIS 2017 (test-dev) | Jaccard (Decay) | 21.9 | SiamMask |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 40.6 | SiamMask |
| Video | DAVIS 2017 (test-dev) | Jaccard (Recall) | 44.5 | SiamMask |
| Video | NT-VOT211 | AUC | 35.14 | SiamMask |
| Video | NT-VOT211 | Precision | 46.49 | SiamMask |
| Object Tracking | VOT2017/18 | Expected Average Overlap (EAO) | 0.38 | SiamMask |
| Object Tracking | YouTube-VOS 2018 | F-Measure (Seen) | 58.2 | SiamMask |
| Object Tracking | YouTube-VOS 2018 | F-Measure (Unseen) | 47.7 | SiamMask |
| Object Tracking | YouTube-VOS 2018 | Jaccard (Seen) | 54.3 | SiamMask |
| Object Tracking | YouTube-VOS 2018 | Jaccard (Unseen) | 45.1 | SiamMask |
| Object Tracking | YouTube-VOS 2018 | O (Average of Measures) | 52.8 | SiamMask |
| Object Tracking | NT-VOT211 | AUC | 35.14 | SiamMask |
| Object Tracking | NT-VOT211 | Precision | 46.49 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Decay) | 20.9 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 58.5 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Recall) | 67.5 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 56.4 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Decay) | 19.3 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 54.3 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Recall) | 62.8 | SiamMask |
| Video Object Segmentation | DAVIS 2016 | F-measure (Decay) | 2.1 | SiamMask |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 67.8 | SiamMask |
| Video Object Segmentation | DAVIS 2016 | F-measure (Recall) | 79.8 | SiamMask |
| Video Object Segmentation | DAVIS 2016 | J&F | 69.75 | SiamMask |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Decay) | 3 | SiamMask |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 71.7 | SiamMask |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Recall) | 86.8 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Decay) | 22.4 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 45.8 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Recall) | 45.3 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 43.2 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Decay) | 21.9 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 40.6 | SiamMask |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Recall) | 44.5 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Decay) | 20.9 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 58.5 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Recall) | 67.5 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 56.4 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Decay) | 19.3 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 54.3 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Recall) | 62.8 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Decay) | 2.1 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 67.8 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Recall) | 79.8 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 69.75 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Decay) | 3 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 71.7 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Recall) | 86.8 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Decay) | 22.4 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 45.8 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Recall) | 45.3 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 43.2 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Decay) | 21.9 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 40.6 | SiamMask |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Recall) | 44.5 | SiamMask |
| Visual Object Tracking | VOT2017/18 | Expected Average Overlap (EAO) | 0.38 | SiamMask |
| Visual Object Tracking | YouTube-VOS 2018 | F-Measure (Seen) | 58.2 | SiamMask |
| Visual Object Tracking | YouTube-VOS 2018 | F-Measure (Unseen) | 47.7 | SiamMask |
| Visual Object Tracking | YouTube-VOS 2018 | Jaccard (Seen) | 54.3 | SiamMask |
| Visual Object Tracking | YouTube-VOS 2018 | Jaccard (Unseen) | 45.1 | SiamMask |
| Visual Object Tracking | YouTube-VOS 2018 | O (Average of Measures) | 52.8 | SiamMask |