Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Computer Vision
/
Referring Expression Segmentation
/
Refer-YouTube-VOS (2021 public validation)
Referring Expression Segmentation on Refer-YouTube-VOS (2021 public validation)
Metric: F (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
F (best first)
F (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
F
▼
Extra Data
Paper
Date
↕
Code
1
MPG-SAM 2
76.1
No
MPG-SAM 2: Adapting SAM 2 with Mask Priors and G...
2025-01-23
Code
2
VRS-HQ (Chat-UniVi-13B)
73.1
No
The Devil is in Temporal Token: High Quality Vid...
2025-01-15
Code
3
GLEE-Pro
72.9
Yes
General Object Foundation Model for Images and V...
2023-12-14
Code
4
UNINEXT-H
72.7
No
Universal Instance Perception as Object Discover...
2023-03-12
Code
5
ReferDINO (Swin-B)
71.5
No
ReferDINO: Referring Video Object Segmentation w...
2025-01-24
-
6
MUTR
70.4
No
Referred by Multi-Modality: A Unified Temporal T...
2023-05-25
Code
7
VLP (VLMo-L)
69.8
No
Harnessing Vision-Language Pretrained Models wit...
2024-05-17
-
8
SOC (Joint training, Video-Swin-B)
69.3
No
SOC: Semantic-Assisted Object Cluster for Referr...
2023-05-26
Code
9
UniRef-L (Swin-L)
69.2
No
-
-
-
10
DsHmp (Video-Swin-Base)
69.1
No
Decoupling Static and Hierarchical Motion Percep...
2024-04-04
Code
11
UniRef++-L
69
No
UniRef++: Segment Every Reference Object in Spat...
2023-12-25
Code
12
HTR (Pre-training)
68.9
No
Temporally Consistent Referring Video Object Seg...
2024-03-28
Code
13
ViLLa
68.6
No
ViLLa: Video Reasoning Segmentation with Large L...
2024-07-18
Code
14
SgMg (Pre-training)
67.4
No
Spectrum-guided Multi-granularity Referring Vide...
2023-07-25
Code
15
EPCFormer (ViT-H)
67.2
No
Expression Prompt Collaboration Transformer for ...
2023-08-08
-
16
UniLSeg-100
67
No
Universal Segmentation at Arbitrary Granularity ...
2023-12-04
Code
17
GroPrompt
66.9
No
GroPrompt: Efficient Grounded Prompting and Adap...
2024-06-18
-
18
LoSh-R
66
Yes
LoSh: Long-Short Text Joint Prediction Network f...
2023-06-14
Code
19
VLT
65.6
No
VLT: Vision-Language Transformer and Query Gener...
2022-10-28
Code
20
OnlineRefer (Swin-L, online)
65.5
No
OnlineRefer: A Simple Online Baseline for Referr...
2023-07-18
Code
21
R2VOS (Video-Swin-T)
63.1
Yes
Towards Robust Referring Video Object Segmentati...
2022-07-04
Code
22
SOC (Video-Swin-T)
60.5
No
SOC: Semantic-Assisted Object Cluster for Referr...
2023-05-26
Code
23
UniVS(Swin-L)
59.5
Yes
UniVS: Unified and Universal Video Segmentation ...
2024-02-28
Code
24
ReferFormer (ResNet-101)
58.4
Yes
Language as Queries for Referring Video Object S...
2022-01-03
Code
25
MTTR (w=12)
56.64
No
End-to-End Referring Video Object Segmentation w...
2021-11-29
Code
26
ReferFormer (ResNet-50)
56.6
Yes
Language as Queries for Referring Video Object S...
2022-01-03
Code
27
MANET
56.51
No
Multi-Attention Network for Compressed Video Ref...
2022-07-26
Code
28
Locater
51.1
No
Local-Global Context Aware Transformer for Langu...
2022-03-18
Code
29
URVOS
50.8
No
-
-
Code
30
VLIDE
50.67
No
Deeply Interleaved Two-Stream Encoder for Referr...
2022-03-30
-
31
MLRLSA
48.43
No
-
-
-
#1
MPG-SAM 2
SOTA
76.1
F
· 2025-01-23
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
Code
#2
VRS-HQ (Chat-UniVi-13B)
SOTA
73.1
F
· 2025-01-15
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
Code
#3
GLEE-Pro
SOTA
72.9
F
· Extra Data
· 2023-12-14
General Object Foundation Model for Images and Videos at Scale
Code
#4
UNINEXT-H
SOTA
72.7
F
· 2023-03-12
Universal Instance Perception as Object Discovery and Retrieval
Code
#5
ReferDINO (Swin-B)
71.5
F
· 2025-01-24
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
#6
MUTR
70.4
F
· 2023-05-25
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
Code
#7
VLP (VLMo-L)
69.8
F
· 2024-05-17
Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation
#8
SOC (Joint training, Video-Swin-B)
69.3
F
· 2023-05-26
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
Code
#9
UniRef-L (Swin-L)
69.2
F
No paper
#10
DsHmp (Video-Swin-Base)
69.1
F
· 2024-04-04
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
Code
#11
UniRef++-L
69
F
· 2023-12-25
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Code
#12
HTR (Pre-training)
68.9
F
· 2024-03-28
Temporally Consistent Referring Video Object Segmentation with Hybrid Memory
Code
#13
ViLLa
68.6
F
· 2024-07-18
ViLLa: Video Reasoning Segmentation with Large Language Model
Code
#14
SgMg (Pre-training)
67.4
F
· 2023-07-25
Spectrum-guided Multi-granularity Referring Video Object Segmentation
Code
#15
EPCFormer (ViT-H)
67.2
F
· 2023-08-08
Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation
#16
UniLSeg-100
67
F
· 2023-12-04
Universal Segmentation at Arbitrary Granularity with Language Instruction
Code
#17
GroPrompt
66.9
F
· 2024-06-18
GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation
#18
LoSh-R
66
F
· Extra Data
· 2023-06-14
LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation
Code
#19
VLT
SOTA
65.6
F
· 2022-10-28
VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
Code
#20
OnlineRefer (Swin-L, online)
65.5
F
· 2023-07-18
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation
Code
#21
R2VOS (Video-Swin-T)
SOTA
63.1
F
· Extra Data
· 2022-07-04
Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus
Code
#22
SOC (Video-Swin-T)
60.5
F
· 2023-05-26
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
Code
#23
UniVS(Swin-L)
59.5
F
· Extra Data
· 2024-02-28
UniVS: Unified and Universal Video Segmentation with Prompts as Queries
Code
#24
ReferFormer (ResNet-101)
SOTA
58.4
F
· Extra Data
· 2022-01-03
Language as Queries for Referring Video Object Segmentation
Code
#25
MTTR (w=12)
SOTA
56.64
F
· 2021-11-29
End-to-End Referring Video Object Segmentation with Multimodal Transformers
Code
#26
ReferFormer (ResNet-50)
56.6
F
· Extra Data
· 2022-01-03
Language as Queries for Referring Video Object Segmentation
Code
#27
MANET
56.51
F
· 2022-07-26
Multi-Attention Network for Compressed Video Referring Object Segmentation
Code
#28
Locater
51.1
F
· 2022-03-18
Local-Global Context Aware Transformer for Language-Guided Video Segmentation
Code
#29
URVOS
50.8
F
No paper
Code
#30
VLIDE
50.67
F
· 2022-03-30
Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation
#31
MLRLSA
48.43
F
No paper