Joint Inductive and Transductive Learning for Video Object Segmentation

Yunyao Mao, Ning Wang, Wengang Zhou, Houqiang Li

2021-08-08ICCV 2021 10Semi-Supervised Video Object Segmentation Semantic Segmentation Video Object Segmentation Video Semantic Segmentation

Paper PDF Code(official)

Abstract

Semi-supervised video object segmentation is a task of segmenting the target object in a video sequence given only a mask annotation in the first frame. The limited information available makes it an extremely challenging task. Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning. Nevertheless, they are either less discriminative for similar instances or insufficient in the utilization of spatio-temporal information. In this work, we propose to integrate transductive and inductive learning into a unified framework to exploit the complementarity between them for accurate and robust video object segmentation. The proposed approach consists of two functional branches. The transduction branch adopts a lightweight transformer architecture to aggregate rich spatio-temporal cues while the induction branch performs online inductive learning to obtain discriminative target information. To bridge these two diverse branches, a two-head label encoder is introduced to learn the suitable target prior for each of them. The generated mask encodings are further forced to be disentangled to better retain their complementarity. Extensive experiments on several prevalent benchmarks show that, without the need of synthetic training data, the proposed approach sets a series of new state-of-the-art records. Code is available at https://github.com/maoyunyao/JOINT.

Results

Task	Dataset	Metric	Value	Model
Video	DAVIS 2017 (val)	F-measure (Mean)	81.2	JOINT
Video	DAVIS 2017 (val)	J&F	78.6	JOINT
Video	DAVIS 2017 (val)	Jaccard (Mean)	76	JOINT
Video	DAVIS (no YouTube-VOS training)	D17 val (F)	81.2	JOINT
Video	DAVIS (no YouTube-VOS training)	D17 val (G)	78.6	JOINT
Video	DAVIS (no YouTube-VOS training)	D17 val (J)	76	JOINT
Video	DAVIS (no YouTube-VOS training)	FPS	4	JOINT
Video Object Segmentation	DAVIS 2017 (val)	F-measure (Mean)	81.2	JOINT
Video Object Segmentation	DAVIS 2017 (val)	J&F	78.6	JOINT
Video Object Segmentation	DAVIS 2017 (val)	Jaccard (Mean)	76	JOINT
Video Object Segmentation	DAVIS (no YouTube-VOS training)	D17 val (F)	81.2	JOINT
Video Object Segmentation	DAVIS (no YouTube-VOS training)	D17 val (G)	78.6	JOINT
Video Object Segmentation	DAVIS (no YouTube-VOS training)	D17 val (J)	76	JOINT
Video Object Segmentation	DAVIS (no YouTube-VOS training)	FPS	4	JOINT
Semi-Supervised Video Object Segmentation	DAVIS 2017 (val)	F-measure (Mean)	81.2	JOINT
Semi-Supervised Video Object Segmentation	DAVIS 2017 (val)	J&F	78.6	JOINT
Semi-Supervised Video Object Segmentation	DAVIS 2017 (val)	Jaccard (Mean)	76	JOINT
Semi-Supervised Video Object Segmentation	DAVIS (no YouTube-VOS training)	D17 val (F)	81.2	JOINT
Semi-Supervised Video Object Segmentation	DAVIS (no YouTube-VOS training)	D17 val (G)	78.6	JOINT
Semi-Supervised Video Object Segmentation	DAVIS (no YouTube-VOS training)	D17 val (J)	76	JOINT
Semi-Supervised Video Object Segmentation	DAVIS (no YouTube-VOS training)	FPS	4	JOINT

Joint Inductive and Transductive Learning for Video Object Segmentation

Abstract

Results

Related Papers

Joint Inductive and Transductive Learning for Video Object Segmentation

Abstract

Results

Related Papers