Zongxin Yang, Yunchao Wei, Yi Yang
This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computing resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects' matching and segmentation decoding as efficiently as processing a single object. For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation. We conduct extensive experiments on both multi-object and single-object benchmarks to examine AOT variant networks with different complexities. Particularly, our R50-AOT-L outperforms all the state-of-the-art competitors on three popular benchmarks, i.e., YouTube-VOS (84.1% J&F), DAVIS 2017 (84.9%), and DAVIS 2016 (91.1%), while keeping more than $3\times$ faster multi-object run-time. Meanwhile, our AOT-T can maintain real-time multi-object speed on the above benchmarks. Based on AOT, we ranked 1st in the 3rd Large-scale VOS Challenge.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | YouTube-VOS 2019 | F-Measure (Seen) | 88.1 | AOT |
| Video | YouTube-VOS 2019 | F-Measure (Unseen) | 86.3 | AOT |
| Video | YouTube-VOS 2019 | Jaccard (Seen) | 83.5 | AOT |
| Video | YouTube-VOS 2019 | Jaccard (Unseen) | 78.4 | AOT |
| Video | DAVIS 2017 (test-dev) | F-measure | 83.3 | AOT |
| Video | DAVIS 2017 (test-dev) | Jaccard | 75.9 | AOT |
| Video | DAVIS 2017 (test-dev) | Mean Jaccard & F-Measure | 79.6 | AOT |
| Video | MOSE | F | 61.3 | AOT |
| Video | MOSE | J | 53.1 | AOT |
| Video | MOSE | J&F | 57.2 | AOT |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 88.4 | SwinB-AOT-L |
| Video | DAVIS 2017 (val) | J&F | 85.4 | SwinB-AOT-L |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 82.4 | SwinB-AOT-L |
| Video | DAVIS 2017 (val) | Params(M) | 65.4 | SwinB-AOT-L |
| Video | DAVIS 2017 (val) | Speed (FPS) | 12.1 | SwinB-AOT-L |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 87.5 | R50-AOT-L |
| Video | DAVIS 2017 (val) | J&F | 84.9 | R50-AOT-L |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 82.3 | R50-AOT-L |
| Video | DAVIS 2017 (val) | Params(M) | 14.9 | R50-AOT-L |
| Video | DAVIS 2017 (val) | Speed (FPS) | 18 | R50-AOT-L |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 86.4 | AOT-L |
| Video | DAVIS 2017 (val) | J&F | 83.8 | AOT-L |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 81.1 | AOT-L |
| Video | DAVIS 2017 (val) | Params(M) | 8.3 | AOT-L |
| Video | DAVIS 2017 (val) | Speed (FPS) | 18.7 | AOT-L |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 85.2 | AOT-B |
| Video | DAVIS 2017 (val) | J&F | 82.5 | AOT-B |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 79.7 | AOT-B |
| Video | DAVIS 2017 (val) | Params(M) | 8.3 | AOT-B |
| Video | DAVIS 2017 (val) | Speed (FPS) | 29.6 | AOT-B |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 83.9 | AOT-S |
| Video | DAVIS 2017 (val) | J&F | 81.3 | AOT-S |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 78.7 | AOT-S |
| Video | DAVIS 2017 (val) | Params(M) | 7 | AOT-S |
| Video | DAVIS 2017 (val) | Speed (FPS) | 40 | AOT-S |
| Video | DAVIS 2017 (val) | F-measure (Mean) | 82.3 | AOT-T |
| Video | DAVIS 2017 (val) | J&F | 79.9 | AOT-T |
| Video | DAVIS 2017 (val) | Jaccard (Mean) | 77.4 | AOT-T |
| Video | DAVIS 2017 (val) | Params(M) | 5.7 | AOT-T |
| Video | DAVIS 2017 (val) | Speed (FPS) | 51.4 | AOT-T |
| Video | VOT2020 | EAO | 0.586 | SwinB-AOT-L |
| Video | VOT2020 | EAO (real-time) | 0.523 | SwinB-AOT-L |
| Video | VOT2020 | EAO | 0.574 | AOT-L |
| Video | VOT2020 | EAO (real-time) | 0.56 | AOT-L |
| Video | VOT2020 | EAO | 0.569 | R50-AOT-L |
| Video | VOT2020 | EAO (real-time) | 0.54 | R50-AOT-L |
| Video | VOT2020 | EAO | 0.541 | AOT-B |
| Video | VOT2020 | EAO (real-time) | 0.533 | AOT-B |
| Video | VOT2020 | EAO | 0.512 | AOT-S |
| Video | VOT2020 | EAO (real-time) | 0.499 | AOT-S |
| Video | VOT2020 | EAO | 0.435 | AOT-T |
| Video | VOT2020 | EAO (real-time) | 0.433 | AOT-T |
| Video | DAVIS 2016 | F-measure (Mean) | 93.3 | SwinB-AOT-L |
| Video | DAVIS 2016 | J&F | 92 | SwinB-AOT-L |
| Video | DAVIS 2016 | Jaccard (Mean) | 90.7 | SwinB-AOT-L |
| Video | DAVIS 2016 | Speed (FPS) | 12.1 | SwinB-AOT-L |
| Video | DAVIS 2016 | F-measure (Mean) | 92.1 | R50-AOT-L |
| Video | DAVIS 2016 | J&F | 91.1 | R50-AOT-L |
| Video | DAVIS 2016 | Jaccard (Mean) | 90.1 | R50-AOT-L |
| Video | DAVIS 2016 | Speed (FPS) | 18 | R50-AOT-L |
| Video | DAVIS 2016 | F-measure (Mean) | 91.1 | AOT-L |
| Video | DAVIS 2016 | J&F | 90.4 | AOT-L |
| Video | DAVIS 2016 | Jaccard (Mean) | 89.6 | AOT-L |
| Video | DAVIS 2016 | Speed (FPS) | 18.7 | AOT-L |
| Video | DAVIS 2016 | F-measure (Mean) | 91.1 | AOT-L |
| Video | DAVIS 2016 | J&F | 89.9 | AOT-L |
| Video | DAVIS 2016 | Jaccard (Mean) | 88.7 | AOT-L |
| Video | DAVIS 2016 | Speed (FPS) | 29.6 | AOT-L |
| Video | DAVIS 2016 | F-measure (Mean) | 90.2 | AOT-S |
| Video | DAVIS 2016 | J&F | 89.4 | AOT-S |
| Video | DAVIS 2016 | Jaccard (Mean) | 88.6 | AOT-S |
| Video | DAVIS 2016 | Speed (FPS) | 40 | AOT-S |
| Video | DAVIS 2016 | F-measure (Mean) | 87.4 | AOT-T |
| Video | DAVIS 2016 | J&F | 86.8 | AOT-T |
| Video | DAVIS 2016 | Jaccard (Mean) | 86.1 | AOT-T |
| Video | DAVIS 2016 | Speed (FPS) | 51.4 | AOT-T |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 85.1 | SwinB-AOT-L |
| Video | DAVIS 2017 (test-dev) | FPS | 12.1 | SwinB-AOT-L |
| Video | DAVIS 2017 (test-dev) | J&F | 81.2 | SwinB-AOT-L |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 77.3 | SwinB-AOT-L |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 83.3 | R50-AOT-L |
| Video | DAVIS 2017 (test-dev) | FPS | 18 | R50-AOT-L |
| Video | DAVIS 2017 (test-dev) | J&F | 79.6 | R50-AOT-L |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 75.9 | R50-AOT-L |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 82.3 | AOT-L |
| Video | DAVIS 2017 (test-dev) | FPS | 18.7 | AOT-L |
| Video | DAVIS 2017 (test-dev) | J&F | 78.3 | AOT-L |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 74.3 | AOT-L |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 79.3 | AOT-B |
| Video | DAVIS 2017 (test-dev) | FPS | 29.6 | AOT-B |
| Video | DAVIS 2017 (test-dev) | J&F | 75.5 | AOT-B |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 71.6 | AOT-B |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 77.5 | AOT-S |
| Video | DAVIS 2017 (test-dev) | FPS | 40 | AOT-S |
| Video | DAVIS 2017 (test-dev) | J&F | 73.9 | AOT-S |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 70.3 | AOT-S |
| Video | DAVIS 2017 (test-dev) | F-measure (Mean) | 75.7 | AOT-T |
| Video | DAVIS 2017 (test-dev) | FPS | 51.4 | AOT-T |
| Video | DAVIS 2017 (test-dev) | J&F | 72 | AOT-T |
| Video | DAVIS 2017 (test-dev) | Jaccard (Mean) | 68.3 | AOT-T |
| Video | DAVIS (no YouTube-VOS training) | D17 val (F) | 82 | AOT-S |
| Video | DAVIS (no YouTube-VOS training) | D17 val (G) | 79.2 | AOT-S |
| Video | DAVIS (no YouTube-VOS training) | D17 val (J) | 76.4 | AOT-S |
| Video | DAVIS (no YouTube-VOS training) | FPS | 40 | AOT-S |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 89.5 | R50-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 88.2 | R50-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 84.5 | R50-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 79.6 | R50-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Overall | 85.5 | R50-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Params(M) | 14.9 | R50-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Speed (FPS) | 6.4 | R50-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 90.1 | SwinB-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 86.9 | SwinB-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 85.1 | SwinB-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 78.4 | SwinB-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Overall | 85.1 | SwinB-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Params(M) | 65.4 | SwinB-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Speed (FPS) | 5.2 | SwinB-AOT-L (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 89.3 | SwinB-AOT-L |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 86.4 | SwinB-AOT-L |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 84.3 | SwinB-AOT-L |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 77.9 | SwinB-AOT-L |
| Video | YouTube-VOS 2018 | Overall | 84.5 | SwinB-AOT-L |
| Video | YouTube-VOS 2018 | Params(M) | 65.4 | SwinB-AOT-L |
| Video | YouTube-VOS 2018 | Speed (FPS) | 9.3 | SwinB-AOT-L |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 88.8 | AOT-L (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 87.1 | AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 83.7 | AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 78.4 | AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Overall | 84.5 | AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-L (all frames) |
| Video | YouTube-VOS 2018 | Speed (FPS) | 6.5 | AOT-L (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 88.5 | R50-AOT-L |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 86.1 | R50-AOT-L |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 83.7 | R50-AOT-L |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 78.1 | R50-AOT-L |
| Video | YouTube-VOS 2018 | Overall | 84.1 | R50-AOT-L |
| Video | YouTube-VOS 2018 | Params(M) | 14.9 | R50-AOT-L |
| Video | YouTube-VOS 2018 | Speed (FPS) | 14.9 | R50-AOT-L |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 88.5 | AOT-B (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 86.5 | AOT-B (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 83.6 | AOT-B (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 78 | AOT-B (all frames) |
| Video | YouTube-VOS 2018 | Overall | 84.1 | AOT-B (all frames) |
| Video | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-B (all frames) |
| Video | YouTube-VOS 2018 | Speed (FPS) | 20.5 | AOT-B (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 87.9 | AOT-L |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 86.5 | AOT-L |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 82.9 | AOT-L |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 77.7 | AOT-L |
| Video | YouTube-VOS 2018 | Overall | 83.8 | AOT-L |
| Video | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-L |
| Video | YouTube-VOS 2018 | Speed (FPS) | 16 | AOT-L |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 87.5 | AOT-B |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 86 | AOT-B |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 82.6 | AOT-B |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 77.7 | AOT-B |
| Video | YouTube-VOS 2018 | Overall | 83.5 | AOT-B |
| Video | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-B |
| Video | YouTube-VOS 2018 | Speed (FPS) | 20.5 | AOT-B |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 87 | AOT-S (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 85.7 | AOT-S (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 82.2 | AOT-S (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 77.3 | AOT-S (all frames) |
| Video | YouTube-VOS 2018 | Overall | 83 | AOT-S (all frames) |
| Video | YouTube-VOS 2018 | Params(M) | 7.9 | AOT-S (all frames) |
| Video | YouTube-VOS 2018 | Speed (FPS) | 27.1 | AOT-S (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 86.7 | AOT-S |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 85 | AOT-S |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 82 | AOT-S |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 76.6 | AOT-S |
| Video | YouTube-VOS 2018 | Overall | 82.6 | AOT-S |
| Video | YouTube-VOS 2018 | Params(M) | 7.9 | AOT-S |
| Video | YouTube-VOS 2018 | Speed (FPS) | 27.1 | AOT-S |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 84.7 | AOT-T (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 83.5 | AOT-T (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 80 | AOT-T (all frames) |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 75.2 | AOT-T (all frames) |
| Video | YouTube-VOS 2018 | Overall | 80.9 | AOT-T (all frames) |
| Video | YouTube-VOS 2018 | Params(M) | 5.3 | AOT-T (all frames) |
| Video | YouTube-VOS 2018 | Speed (FPS) | 41 | AOT-T (all frames) |
| Video | YouTube-VOS 2018 | F-Measure (Seen) | 84.5 | AOT-T |
| Video | YouTube-VOS 2018 | F-Measure (Unseen) | 82.2 | AOT-T |
| Video | YouTube-VOS 2018 | Jaccard (Seen) | 80.1 | AOT-T |
| Video | YouTube-VOS 2018 | Jaccard (Unseen) | 74 | AOT-T |
| Video | YouTube-VOS 2018 | Overall | 80.2 | AOT-T |
| Video | YouTube-VOS 2018 | Params(M) | 5.3 | AOT-T |
| Video | YouTube-VOS 2018 | Speed (FPS) | 41 | AOT-T |
| Object Tracking | VOT2022 | EAO | 0.673 | MS_AOT |
| Video Object Segmentation | YouTube-VOS 2019 | F-Measure (Seen) | 88.1 | AOT |
| Video Object Segmentation | YouTube-VOS 2019 | F-Measure (Unseen) | 86.3 | AOT |
| Video Object Segmentation | YouTube-VOS 2019 | Jaccard (Seen) | 83.5 | AOT |
| Video Object Segmentation | YouTube-VOS 2019 | Jaccard (Unseen) | 78.4 | AOT |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure | 83.3 | AOT |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard | 75.9 | AOT |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Mean Jaccard & F-Measure | 79.6 | AOT |
| Video Object Segmentation | MOSE | F | 61.3 | AOT |
| Video Object Segmentation | MOSE | J | 53.1 | AOT |
| Video Object Segmentation | MOSE | J&F | 57.2 | AOT |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 88.4 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 85.4 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 82.4 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 65.4 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 12.1 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 87.5 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 84.9 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 82.3 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 14.9 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 18 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 86.4 | AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 83.8 | AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 81.1 | AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 8.3 | AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 18.7 | AOT-L |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 85.2 | AOT-B |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 82.5 | AOT-B |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 79.7 | AOT-B |
| Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 8.3 | AOT-B |
| Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 29.6 | AOT-B |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 83.9 | AOT-S |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 81.3 | AOT-S |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 78.7 | AOT-S |
| Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 7 | AOT-S |
| Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 40 | AOT-S |
| Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 82.3 | AOT-T |
| Video Object Segmentation | DAVIS 2017 (val) | J&F | 79.9 | AOT-T |
| Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 77.4 | AOT-T |
| Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 5.7 | AOT-T |
| Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 51.4 | AOT-T |
| Video Object Segmentation | VOT2020 | EAO | 0.586 | SwinB-AOT-L |
| Video Object Segmentation | VOT2020 | EAO (real-time) | 0.523 | SwinB-AOT-L |
| Video Object Segmentation | VOT2020 | EAO | 0.574 | AOT-L |
| Video Object Segmentation | VOT2020 | EAO (real-time) | 0.56 | AOT-L |
| Video Object Segmentation | VOT2020 | EAO | 0.569 | R50-AOT-L |
| Video Object Segmentation | VOT2020 | EAO (real-time) | 0.54 | R50-AOT-L |
| Video Object Segmentation | VOT2020 | EAO | 0.541 | AOT-B |
| Video Object Segmentation | VOT2020 | EAO (real-time) | 0.533 | AOT-B |
| Video Object Segmentation | VOT2020 | EAO | 0.512 | AOT-S |
| Video Object Segmentation | VOT2020 | EAO (real-time) | 0.499 | AOT-S |
| Video Object Segmentation | VOT2020 | EAO | 0.435 | AOT-T |
| Video Object Segmentation | VOT2020 | EAO (real-time) | 0.433 | AOT-T |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 93.3 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2016 | J&F | 92 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 90.7 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 12.1 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 92.1 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2016 | J&F | 91.1 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 90.1 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 18 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 91.1 | AOT-L |
| Video Object Segmentation | DAVIS 2016 | J&F | 90.4 | AOT-L |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 89.6 | AOT-L |
| Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 18.7 | AOT-L |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 91.1 | AOT-L |
| Video Object Segmentation | DAVIS 2016 | J&F | 89.9 | AOT-L |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 88.7 | AOT-L |
| Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 29.6 | AOT-L |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 90.2 | AOT-S |
| Video Object Segmentation | DAVIS 2016 | J&F | 89.4 | AOT-S |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 88.6 | AOT-S |
| Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 40 | AOT-S |
| Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 87.4 | AOT-T |
| Video Object Segmentation | DAVIS 2016 | J&F | 86.8 | AOT-T |
| Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 86.1 | AOT-T |
| Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 51.4 | AOT-T |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 85.1 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 12.1 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 81.2 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 77.3 | SwinB-AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 83.3 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 18 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 79.6 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 75.9 | R50-AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 82.3 | AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 18.7 | AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 78.3 | AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 74.3 | AOT-L |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 79.3 | AOT-B |
| Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 29.6 | AOT-B |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 75.5 | AOT-B |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 71.6 | AOT-B |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 77.5 | AOT-S |
| Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 40 | AOT-S |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 73.9 | AOT-S |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 70.3 | AOT-S |
| Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 75.7 | AOT-T |
| Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 51.4 | AOT-T |
| Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 72 | AOT-T |
| Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 68.3 | AOT-T |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 82 | AOT-S |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 79.2 | AOT-S |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 76.4 | AOT-S |
| Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 40 | AOT-S |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 89.5 | R50-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 88.2 | R50-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 84.5 | R50-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 79.6 | R50-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 85.5 | R50-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 14.9 | R50-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 6.4 | R50-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 90.1 | SwinB-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86.9 | SwinB-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 85.1 | SwinB-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 78.4 | SwinB-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 85.1 | SwinB-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 65.4 | SwinB-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 5.2 | SwinB-AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 89.3 | SwinB-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86.4 | SwinB-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 84.3 | SwinB-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 77.9 | SwinB-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 84.5 | SwinB-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 65.4 | SwinB-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 9.3 | SwinB-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 88.8 | AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 87.1 | AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 83.7 | AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 78.4 | AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 84.5 | AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 6.5 | AOT-L (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 88.5 | R50-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86.1 | R50-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 83.7 | R50-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 78.1 | R50-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 84.1 | R50-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 14.9 | R50-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 14.9 | R50-AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 88.5 | AOT-B (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86.5 | AOT-B (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 83.6 | AOT-B (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 78 | AOT-B (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 84.1 | AOT-B (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-B (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 20.5 | AOT-B (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 87.9 | AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86.5 | AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82.9 | AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 77.7 | AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 83.8 | AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 16 | AOT-L |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 87.5 | AOT-B |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86 | AOT-B |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82.6 | AOT-B |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 77.7 | AOT-B |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 83.5 | AOT-B |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-B |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 20.5 | AOT-B |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 87 | AOT-S (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 85.7 | AOT-S (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82.2 | AOT-S (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 77.3 | AOT-S (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 83 | AOT-S (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 7.9 | AOT-S (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 27.1 | AOT-S (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 86.7 | AOT-S |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 85 | AOT-S |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82 | AOT-S |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 76.6 | AOT-S |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 82.6 | AOT-S |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 7.9 | AOT-S |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 27.1 | AOT-S |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 84.7 | AOT-T (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 83.5 | AOT-T (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 80 | AOT-T (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 75.2 | AOT-T (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 80.9 | AOT-T (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 5.3 | AOT-T (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 41 | AOT-T (all frames) |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 84.5 | AOT-T |
| Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 82.2 | AOT-T |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 80.1 | AOT-T |
| Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 74 | AOT-T |
| Video Object Segmentation | YouTube-VOS 2018 | Overall | 80.2 | AOT-T |
| Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 5.3 | AOT-T |
| Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 41 | AOT-T |
| Semi-Supervised Video Object Segmentation | MOSE | F | 61.3 | AOT |
| Semi-Supervised Video Object Segmentation | MOSE | J | 53.1 | AOT |
| Semi-Supervised Video Object Segmentation | MOSE | J&F | 57.2 | AOT |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 88.4 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 85.4 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 82.4 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 65.4 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 12.1 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 87.5 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 84.9 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 82.3 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 14.9 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 18 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 86.4 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 83.8 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 81.1 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 8.3 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 18.7 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 85.2 | AOT-B |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 82.5 | AOT-B |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 79.7 | AOT-B |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 8.3 | AOT-B |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 29.6 | AOT-B |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 83.9 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 81.3 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 78.7 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 7 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 40 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | F-measure (Mean) | 82.3 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | J&F | 79.9 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Jaccard (Mean) | 77.4 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Params(M) | 5.7 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (val) | Speed (FPS) | 51.4 | AOT-T |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO | 0.586 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO (real-time) | 0.523 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO | 0.574 | AOT-L |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO (real-time) | 0.56 | AOT-L |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO | 0.569 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO (real-time) | 0.54 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO | 0.541 | AOT-B |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO (real-time) | 0.533 | AOT-B |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO | 0.512 | AOT-S |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO (real-time) | 0.499 | AOT-S |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO | 0.435 | AOT-T |
| Semi-Supervised Video Object Segmentation | VOT2020 | EAO (real-time) | 0.433 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 93.3 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 92 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 90.7 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 12.1 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 92.1 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 91.1 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 90.1 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 18 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 91.1 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 90.4 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 89.6 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 18.7 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 91.1 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 89.9 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 88.7 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 29.6 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 90.2 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 89.4 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 88.6 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 40 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | F-measure (Mean) | 87.4 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | J&F | 86.8 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Jaccard (Mean) | 86.1 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2016 | Speed (FPS) | 51.4 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 85.1 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 12.1 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 81.2 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 77.3 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 83.3 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 18 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 79.6 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 75.9 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 82.3 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 18.7 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 78.3 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 74.3 | AOT-L |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 79.3 | AOT-B |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 29.6 | AOT-B |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 75.5 | AOT-B |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 71.6 | AOT-B |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 77.5 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 40 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 73.9 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 70.3 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | F-measure (Mean) | 75.7 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | FPS | 51.4 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | J&F | 72 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS 2017 (test-dev) | Jaccard (Mean) | 68.3 | AOT-T |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (F) | 82 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (G) | 79.2 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | D17 val (J) | 76.4 | AOT-S |
| Semi-Supervised Video Object Segmentation | DAVIS (no YouTube-VOS training) | FPS | 40 | AOT-S |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 89.5 | R50-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 88.2 | R50-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 84.5 | R50-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 79.6 | R50-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 85.5 | R50-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 14.9 | R50-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 6.4 | R50-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 90.1 | SwinB-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86.9 | SwinB-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 85.1 | SwinB-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 78.4 | SwinB-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 85.1 | SwinB-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 65.4 | SwinB-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 5.2 | SwinB-AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 89.3 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86.4 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 84.3 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 77.9 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 84.5 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 65.4 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 9.3 | SwinB-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 88.8 | AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 87.1 | AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 83.7 | AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 78.4 | AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 84.5 | AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 6.5 | AOT-L (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 88.5 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86.1 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 83.7 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 78.1 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 84.1 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 14.9 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 14.9 | R50-AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 88.5 | AOT-B (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86.5 | AOT-B (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 83.6 | AOT-B (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 78 | AOT-B (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 84.1 | AOT-B (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-B (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 20.5 | AOT-B (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 87.9 | AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86.5 | AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82.9 | AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 77.7 | AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 83.8 | AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 16 | AOT-L |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 87.5 | AOT-B |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 86 | AOT-B |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82.6 | AOT-B |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 77.7 | AOT-B |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 83.5 | AOT-B |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 8.3 | AOT-B |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 20.5 | AOT-B |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 87 | AOT-S (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 85.7 | AOT-S (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82.2 | AOT-S (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 77.3 | AOT-S (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 83 | AOT-S (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 7.9 | AOT-S (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 27.1 | AOT-S (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 86.7 | AOT-S |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 85 | AOT-S |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 82 | AOT-S |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 76.6 | AOT-S |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 82.6 | AOT-S |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 7.9 | AOT-S |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 27.1 | AOT-S |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 84.7 | AOT-T (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 83.5 | AOT-T (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 80 | AOT-T (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 75.2 | AOT-T (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 80.9 | AOT-T (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 5.3 | AOT-T (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 41 | AOT-T (all frames) |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Seen) | 84.5 | AOT-T |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | F-Measure (Unseen) | 82.2 | AOT-T |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Seen) | 80.1 | AOT-T |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Jaccard (Unseen) | 74 | AOT-T |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Overall | 80.2 | AOT-T |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Params(M) | 5.3 | AOT-T |
| Semi-Supervised Video Object Segmentation | YouTube-VOS 2018 | Speed (FPS) | 41 | AOT-T |
| Visual Object Tracking | VOT2022 | EAO | 0.673 | MS_AOT |