Actor and Action Modular Network for Text-based Video Segmentation

Jianhua Yang, Yan Huang, Kai Niu, Linjiang Huang, Zhanyu Ma, Liang Wang

2020-11-02Action Segmentation Referring Expression Segmentation Segmentation Semantic Segmentation Video Segmentation Video Semantic Segmentation Action Understanding

Paper PDF

Abstract

Text-based video segmentation aims to segment an actor in video sequences by specifying the actor and its performing action with a textual query. Previous methods fail to explicitly align the video content with the textual query in a fine-grained manner according to the actor and its action, due to the problem of \emph{semantic asymmetry}. The \emph{semantic asymmetry} implies that two modalities contain different amounts of semantic information during the multi-modal fusion process. To alleviate this problem, we propose a novel actor and action modular network that individually localizes the actor and its action in two separate modules. Specifically, we first learn the actor-/action-related content from the video and textual query, and then match them in a symmetrical manner to localize the target tube. The target tube contains the desired actor and action which is then fed into a fully convolutional network to predict segmentation masks of the actor. Our method also establishes the association of objects cross multiple frames with the proposed temporal proposal aggregation mechanism. This enables our method to segment the video effectively and keep the temporal consistency of predictions. The whole model is allowed for joint learning of the actor-action matching and segmentation, as well as achieves the state-of-the-art performance for both single-frame segmentation and full video segmentation on A2D Sentences and J-HMDB Sentences datasets.

Results

Task	Dataset	Metric	Value	Model
Instance Segmentation	A2D Sentences	AP	0.396	AAMN
Instance Segmentation	A2D Sentences	IoU mean	0.552	AAMN
Instance Segmentation	A2D Sentences	IoU overall	0.617	AAMN
Instance Segmentation	A2D Sentences	Precision@0.5	0.681	AAMN
Instance Segmentation	A2D Sentences	Precision@0.6	0.629	AAMN
Instance Segmentation	A2D Sentences	Precision@0.7	0.523	AAMN
Instance Segmentation	A2D Sentences	Precision@0.8	0.296	AAMN
Instance Segmentation	A2D Sentences	Precision@0.9	0.029	AAMN
Instance Segmentation	J-HMDB	AP	0.321	AAMN
Instance Segmentation	J-HMDB	IoU mean	0.576	AAMN
Instance Segmentation	J-HMDB	IoU overall	0.583	AAMN
Instance Segmentation	J-HMDB	Precision@0.5	0.773	AAMN
Instance Segmentation	J-HMDB	Precision@0.6	0.627	AAMN
Instance Segmentation	J-HMDB	Precision@0.7	0.36	AAMN
Instance Segmentation	J-HMDB	Precision@0.8	0.044	AAMN
Referring Expression Segmentation	A2D Sentences	AP	0.396	AAMN
Referring Expression Segmentation	A2D Sentences	IoU mean	0.552	AAMN
Referring Expression Segmentation	A2D Sentences	IoU overall	0.617	AAMN
Referring Expression Segmentation	A2D Sentences	Precision@0.5	0.681	AAMN
Referring Expression Segmentation	A2D Sentences	Precision@0.6	0.629	AAMN
Referring Expression Segmentation	A2D Sentences	Precision@0.7	0.523	AAMN
Referring Expression Segmentation	A2D Sentences	Precision@0.8	0.296	AAMN
Referring Expression Segmentation	A2D Sentences	Precision@0.9	0.029	AAMN
Referring Expression Segmentation	J-HMDB	AP	0.321	AAMN
Referring Expression Segmentation	J-HMDB	IoU mean	0.576	AAMN
Referring Expression Segmentation	J-HMDB	IoU overall	0.583	AAMN
Referring Expression Segmentation	J-HMDB	Precision@0.5	0.773	AAMN
Referring Expression Segmentation	J-HMDB	Precision@0.6	0.627	AAMN
Referring Expression Segmentation	J-HMDB	Precision@0.7	0.36	AAMN
Referring Expression Segmentation	J-HMDB	Precision@0.8	0.044	AAMN

Actor and Action Modular Network for Text-based Video Segmentation

Abstract

Results

Related Papers

Actor and Action Modular Network for Text-based Video Segmentation

Abstract

Results

Related Papers