Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang
The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_{\text{ext}}$ and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Tracking | LaSOT | AUC | 74.2 | SAMURAI-L |
| Object Tracking | LaSOT | Normalized Precision | 82.7 | SAMURAI-L |
| Object Tracking | LaSOT | Precision | 80.2 | SAMURAI-L |
| Object Tracking | NeedForSpeed | AUC | 0.692 | SAMURAI-L |
| Object Tracking | DiDi | Tracking quality | 0.68 | SAMURAI |
| Object Tracking | GOT-10k | Average Overlap | 81.7 | SAMURAI-L |
| Object Tracking | GOT-10k | Success Rate 0.5 | 92.2 | SAMURAI-L |
| Object Tracking | GOT-10k | Success Rate 0.75 | 76.9 | SAMURAI-L |
| Object Tracking | LaSOT-ext | AUC | 61 | SAMURAI-L |
| Object Tracking | LaSOT-ext | Normalized Precision | 73.9 | SAMURAI-L |
| Object Tracking | LaSOT-ext | Precision | 72.2 | SAMURAI-L |
| Object Tracking | TrackingNet | Accuracy | 85.3 | SAMURAI-L |
| Object Tracking | OTB-2015 | AUC | 0.715 | SAMURAI-L |
| Object Tracking | LaSOT | AUC | 74.2 | SAMURAI-L |
| Object Tracking | LaSOT | Normalized Precision | 82.7 | SAMURAI-L |
| Object Tracking | LaSOT | Precision | 80.2 | SAMURAI-L |
| Visual Object Tracking | LaSOT | AUC | 74.2 | SAMURAI-L |
| Visual Object Tracking | LaSOT | Normalized Precision | 82.7 | SAMURAI-L |
| Visual Object Tracking | LaSOT | Precision | 80.2 | SAMURAI-L |
| Visual Object Tracking | NeedForSpeed | AUC | 0.692 | SAMURAI-L |
| Visual Object Tracking | DiDi | Tracking quality | 0.68 | SAMURAI |
| Visual Object Tracking | GOT-10k | Average Overlap | 81.7 | SAMURAI-L |
| Visual Object Tracking | GOT-10k | Success Rate 0.5 | 92.2 | SAMURAI-L |
| Visual Object Tracking | GOT-10k | Success Rate 0.75 | 76.9 | SAMURAI-L |
| Visual Object Tracking | LaSOT-ext | AUC | 61 | SAMURAI-L |
| Visual Object Tracking | LaSOT-ext | Normalized Precision | 73.9 | SAMURAI-L |
| Visual Object Tracking | LaSOT-ext | Precision | 72.2 | SAMURAI-L |
| Visual Object Tracking | TrackingNet | Accuracy | 85.3 | SAMURAI-L |
| Visual Object Tracking | OTB-2015 | AUC | 0.715 | SAMURAI-L |
| Visual Object Tracking | LaSOT | AUC | 74.2 | SAMURAI-L |
| Visual Object Tracking | LaSOT | Normalized Precision | 82.7 | SAMURAI-L |
| Visual Object Tracking | LaSOT | Precision | 80.2 | SAMURAI-L |