Minji Kim, Seungkwan Lee, Jungseul Ok, Bohyung Han, Minsu Cho
Despite the extensive adoption of machine learning on the task of visual object tracking, recent learning-based approaches have largely overlooked the fact that visual tracking is a sequence-level task in its nature; they rely heavily on frame-level training, which inevitably induces inconsistency between training and testing in terms of both data distributions and task objectives. This work introduces a sequence-level training strategy for visual tracking based on reinforcement learning and discusses how a sequence-level design of data sampling, learning objectives, and data augmentation can improve the accuracy and robustness of tracking algorithms. Our experiments on standard benchmarks including LaSOT, TrackingNet, and GOT-10k demonstrate that four representative tracking models, SiamRPN++, SiamAttn, TransT, and TrDiMP, consistently improve by incorporating the proposed methods in training without modifying architectures.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | NT-VOT211 | AUC | 37.22 | SLT-TransT |
| Video | NT-VOT211 | Precision | 51.7 | SLT-TransT |
| Object Tracking | LaSOT | AUC | 66.8 | SLT-TransT |
| Object Tracking | LaSOT | Normalized Precision | 75.5 | SLT-TransT |
| Object Tracking | GOT-10k | Average Overlap | 67.5 | SLT-TransT |
| Object Tracking | GOT-10k | Success Rate 0.5 | 76.8 | SLT-TransT |
| Object Tracking | GOT-10k | Success Rate 0.75 | 60.3 | SLT-TransT |
| Object Tracking | TrackingNet | Accuracy | 82.8 | SLT-TransT |
| Object Tracking | TrackingNet | Normalized Precision | 87.5 | SLT-TransT |
| Object Tracking | TrackingNet | Precision | 81.4 | SLT-TransT |
| Object Tracking | NT-VOT211 | AUC | 37.22 | SLT-TransT |
| Object Tracking | NT-VOT211 | Precision | 51.7 | SLT-TransT |
| Visual Object Tracking | LaSOT | AUC | 66.8 | SLT-TransT |
| Visual Object Tracking | LaSOT | Normalized Precision | 75.5 | SLT-TransT |
| Visual Object Tracking | GOT-10k | Average Overlap | 67.5 | SLT-TransT |
| Visual Object Tracking | GOT-10k | Success Rate 0.5 | 76.8 | SLT-TransT |
| Visual Object Tracking | GOT-10k | Success Rate 0.75 | 60.3 | SLT-TransT |
| Visual Object Tracking | TrackingNet | Accuracy | 82.8 | SLT-TransT |
| Visual Object Tracking | TrackingNet | Normalized Precision | 87.5 | SLT-TransT |
| Visual Object Tracking | TrackingNet | Precision | 81.4 | SLT-TransT |