ReVOS

VideosCC BY-NC-SA 4.0Introduced 2024-07-16

We create a benchmark dataset named ReVOS. This dataset comprises 35,074 pairs of instruction-mask sequences derived from 1,042 diverse videos. In contrast to traditional referring video segmentation datasets, such as Ref-YouTube-VOS and MeViS, which primarily contain explicit short phrases, ReVOS includes text instructions that necessitates a sophisticated understanding of both video content and general world knowledge