Paul Voigtlaender, Jonathon Luiten, Philip H. S. Torr, Bastian Leibe
We present Siam R-CNN, a Siamese re-detection architecture which unleashes the full power of two-stage object detection approaches for visual object tracking. We combine this with a novel tracklet-based dynamic programming algorithm, which takes advantage of re-detections of both the first-frame template and previous-frame predictions, to model the full history of both the object to be tracked and potential distractor objects. This enables our approach to make better tracking decisions, as well as to re-detect tracked objects after long occlusion. Finally, we propose a novel hard example mining strategy to improve Siam R-CNN's robustness to similar looking objects. Siam R-CNN achieves the current best performance on ten tracking benchmarks, with especially strong results for long-term tracking. We make our code and models available at www.vision.rwth-aachen.de/page/siamrcnn.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | DAVIS 2017 (val) | F-measure (Decay) | 16.2 | Siam R-CNN |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 75 | Siam R-CNN |
| Video | DAVIS 2017 (val) | F-measure (Recall) | 82.8 | Siam R-CNN |
| Video | DAVIS 2017 (val) | J&F | 70.55 | Siam R-CNN |
| Video | DAVIS 2017 (val) | Jaccard (Decay) | 15.8 | Siam R-CNN |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 66.1 | Siam R-CNN |
| Video | DAVIS 2017 (val) | Jaccard (Recall) | 74.8 | Siam R-CNN |
| Video | DAVIS 2016 | F-measure (Decay) | 4 | Siam R-CNN |
| Video | DAVIS 2016 | F-measure (Mean) | 80.4 | Siam R-CNN |
| Video | DAVIS 2016 | F-measure (Recall) | 87.6 | Siam R-CNN |
| Video | DAVIS 2016 | J&F | 78.6 | Siam R-CNN |
| Video | DAVIS 2016 | Jaccard (Decay) | 2.2 | Siam R-CNN |
| Video | DAVIS 2016 | Jaccard (Mean) | 76.8 | Siam R-CNN |
| Video | DAVIS 2016 | Jaccard (Recall) | 86.4 | Siam R-CNN |
| Video | DAVIS 2017 (test-dev) | F-measure (Decay) | 20.2 | Siam R-CNN |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 58.6 | Siam R-CNN |
| Video | DAVIS 2017 (test-dev) | F-measure (Recall) | 62.3 | Siam R-CNN |
| Video | DAVIS 2017 (test-dev) | J&F | 53.3 | Siam R-CNN |
| Video | DAVIS 2017 (test-dev) | Jaccard (Decay) | 21.8 | Siam R-CNN |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 48 | Siam R-CNN |
| Video | DAVIS 2017 (test-dev) | Jaccard (Recall) | 53.9 | Siam R-CNN |
| Object Tracking | COESOT | Precision Rate | 67.5 | SiamR-CNN |
| Object Tracking | COESOT | Success Rate | 60.9 | SiamR-CNN |
| Object Tracking | LaSOT | AUC | 64.8 | Siam R-CNN |
| Object Tracking | LaSOT | Normalized Precision | 72.2 | Siam R-CNN |
| Object Tracking | GOT-10k | Average Overlap | 64.9 | Siam R-CNN |
| Object Tracking | GOT-10k | Success Rate 0.5 | 72.8 | Siam R-CNN |
| Object Tracking | TrackingNet | Accuracy | 81.2 | Siam R-CNN |
| Object Tracking | TrackingNet | Normalized Precision | 85.4 | Siam R-CNN |
| Object Tracking | TrackingNet | Precision | 80 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Decay) | 16.2 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 75 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Recall) | 82.8 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 70.55 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Decay) | 15.8 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 66.1 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Recall) | 74.8 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2016 | F-measure (Decay) | 4 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 80.4 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2016 | F-measure (Recall) | 87.6 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2016 | J&F | 78.6 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Decay) | 2.2 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 76.8 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Recall) | 86.4 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Decay) | 20.2 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 58.6 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Recall) | 62.3 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 53.3 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Decay) | 21.8 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 48 | Siam R-CNN |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Recall) | 53.9 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Decay) | 16.2 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 75 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Recall) | 82.8 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 70.55 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Decay) | 15.8 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 66.1 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Recall) | 74.8 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Decay) | 4 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 80.4 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Recall) | 87.6 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 78.6 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Decay) | 2.2 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 76.8 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Recall) | 86.4 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Decay) | 20.2 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 58.6 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Recall) | 62.3 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 53.3 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Decay) | 21.8 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 48 | Siam R-CNN |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Recall) | 53.9 | Siam R-CNN |
| Visual Object Tracking | LaSOT | AUC | 64.8 | Siam R-CNN |
| Visual Object Tracking | LaSOT | Normalized Precision | 72.2 | Siam R-CNN |
| Visual Object Tracking | GOT-10k | Average Overlap | 64.9 | Siam R-CNN |
| Visual Object Tracking | GOT-10k | Success Rate 0.5 | 72.8 | Siam R-CNN |
| Visual Object Tracking | TrackingNet | Accuracy | 81.2 | Siam R-CNN |
| Visual Object Tracking | TrackingNet | Normalized Precision | 85.4 | Siam R-CNN |
| Visual Object Tracking | TrackingNet | Precision | 80 | Siam R-CNN |