Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, Cees G. M. Snoek
This paper strives for pixel-level segmentation of actors and their actions in video content. Different from existing works, which all learn to segment from a fixed vocabulary of actor and action pairs, we infer the segmentation from a natural language input sentence. This allows to distinguish between fine-grained actors in the same super-category, identify actor and action instances, and segment pairs that are outside of the actor and action vocabulary. We propose a fully-convolutional model for pixel-level actor and action segmentation using an encoder-decoder architecture optimized for video. To show the potential of actor and action video segmentation from a sentence, we extend two popular actor and action datasets with more than 7,500 natural language descriptions. Experiments demonstrate the quality of the sentence-guided segmentations, the generalization ability of our model, and its advantage for traditional actor and action segmentation compared to the state-of-the-art.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Instance Segmentation | A2D Sentences | AP | 0.215 | Gavriluyk el al. (Optical flow) |
| Instance Segmentation | A2D Sentences | IoU mean | 0.426 | Gavriluyk el al. (Optical flow) |
| Instance Segmentation | A2D Sentences | IoU overall | 0.551 | Gavriluyk el al. (Optical flow) |
| Instance Segmentation | A2D Sentences | Precision@0.5 | 0.5 | Gavriluyk el al. (Optical flow) |
| Instance Segmentation | A2D Sentences | Precision@0.6 | 0.376 | Gavriluyk el al. (Optical flow) |
| Instance Segmentation | A2D Sentences | Precision@0.7 | 0.231 | Gavriluyk el al. (Optical flow) |
| Instance Segmentation | A2D Sentences | Precision@0.8 | 0.094 | Gavriluyk el al. (Optical flow) |
| Instance Segmentation | A2D Sentences | Precision@0.9 | 0.004 | Gavriluyk el al. (Optical flow) |
| Instance Segmentation | A2D Sentences | AP | 0.198 | Gavriluyk el al. |
| Instance Segmentation | A2D Sentences | IoU mean | 0.421 | Gavriluyk el al. |
| Instance Segmentation | A2D Sentences | IoU overall | 0.536 | Gavriluyk el al. |
| Instance Segmentation | A2D Sentences | Precision@0.5 | 0.475 | Gavriluyk el al. |
| Instance Segmentation | A2D Sentences | Precision@0.6 | 0.347 | Gavriluyk el al. |
| Instance Segmentation | A2D Sentences | Precision@0.7 | 0.211 | Gavriluyk el al. |
| Instance Segmentation | A2D Sentences | Precision@0.8 | 0.08 | Gavriluyk el al. |
| Instance Segmentation | A2D Sentences | Precision@0.9 | 0.002 | Gavriluyk el al. |
| Instance Segmentation | J-HMDB | AP | 0.267 | Gavrilyuk et al. (Optical flow) |
| Instance Segmentation | J-HMDB | IoU mean | 0.57 | Gavrilyuk et al. (Optical flow) |
| Instance Segmentation | J-HMDB | IoU overall | 0.555 | Gavrilyuk et al. (Optical flow) |
| Instance Segmentation | J-HMDB | Precision@0.5 | 0.712 | Gavrilyuk et al. (Optical flow) |
| Instance Segmentation | J-HMDB | Precision@0.6 | 0.518 | Gavrilyuk et al. (Optical flow) |
| Instance Segmentation | J-HMDB | Precision@0.7 | 0.264 | Gavrilyuk et al. (Optical flow) |
| Instance Segmentation | J-HMDB | Precision@0.8 | 0.03 | Gavrilyuk et al. (Optical flow) |
| Instance Segmentation | J-HMDB | AP | 0.233 | Gavrilyuk et al. |
| Instance Segmentation | J-HMDB | IoU mean | 0.542 | Gavrilyuk et al. |
| Instance Segmentation | J-HMDB | IoU overall | 0.541 | Gavrilyuk et al. |
| Instance Segmentation | J-HMDB | Precision@0.5 | 0.699 | Gavrilyuk et al. |
| Instance Segmentation | J-HMDB | Precision@0.6 | 0.46 | Gavrilyuk et al. |
| Instance Segmentation | J-HMDB | Precision@0.7 | 0.173 | Gavrilyuk et al. |
| Instance Segmentation | J-HMDB | Precision@0.8 | 0.014 | Gavrilyuk et al. |
| Referring Expression Segmentation | A2D Sentences | AP | 0.215 | Gavriluyk el al. (Optical flow) |
| Referring Expression Segmentation | A2D Sentences | IoU mean | 0.426 | Gavriluyk el al. (Optical flow) |
| Referring Expression Segmentation | A2D Sentences | IoU overall | 0.551 | Gavriluyk el al. (Optical flow) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.5 | 0.5 | Gavriluyk el al. (Optical flow) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.6 | 0.376 | Gavriluyk el al. (Optical flow) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.7 | 0.231 | Gavriluyk el al. (Optical flow) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.8 | 0.094 | Gavriluyk el al. (Optical flow) |
| Referring Expression Segmentation | A2D Sentences | Precision@0.9 | 0.004 | Gavriluyk el al. (Optical flow) |
| Referring Expression Segmentation | A2D Sentences | AP | 0.198 | Gavriluyk el al. |
| Referring Expression Segmentation | A2D Sentences | IoU mean | 0.421 | Gavriluyk el al. |
| Referring Expression Segmentation | A2D Sentences | IoU overall | 0.536 | Gavriluyk el al. |
| Referring Expression Segmentation | A2D Sentences | Precision@0.5 | 0.475 | Gavriluyk el al. |
| Referring Expression Segmentation | A2D Sentences | Precision@0.6 | 0.347 | Gavriluyk el al. |
| Referring Expression Segmentation | A2D Sentences | Precision@0.7 | 0.211 | Gavriluyk el al. |
| Referring Expression Segmentation | A2D Sentences | Precision@0.8 | 0.08 | Gavriluyk el al. |
| Referring Expression Segmentation | A2D Sentences | Precision@0.9 | 0.002 | Gavriluyk el al. |
| Referring Expression Segmentation | J-HMDB | AP | 0.267 | Gavrilyuk et al. (Optical flow) |
| Referring Expression Segmentation | J-HMDB | IoU mean | 0.57 | Gavrilyuk et al. (Optical flow) |
| Referring Expression Segmentation | J-HMDB | IoU overall | 0.555 | Gavrilyuk et al. (Optical flow) |
| Referring Expression Segmentation | J-HMDB | Precision@0.5 | 0.712 | Gavrilyuk et al. (Optical flow) |
| Referring Expression Segmentation | J-HMDB | Precision@0.6 | 0.518 | Gavrilyuk et al. (Optical flow) |
| Referring Expression Segmentation | J-HMDB | Precision@0.7 | 0.264 | Gavrilyuk et al. (Optical flow) |
| Referring Expression Segmentation | J-HMDB | Precision@0.8 | 0.03 | Gavrilyuk et al. (Optical flow) |
| Referring Expression Segmentation | J-HMDB | AP | 0.233 | Gavrilyuk et al. |
| Referring Expression Segmentation | J-HMDB | IoU mean | 0.542 | Gavrilyuk et al. |
| Referring Expression Segmentation | J-HMDB | IoU overall | 0.541 | Gavrilyuk et al. |
| Referring Expression Segmentation | J-HMDB | Precision@0.5 | 0.699 | Gavrilyuk et al. |
| Referring Expression Segmentation | J-HMDB | Precision@0.6 | 0.46 | Gavrilyuk et al. |
| Referring Expression Segmentation | J-HMDB | Precision@0.7 | 0.173 | Gavrilyuk et al. |
| Referring Expression Segmentation | J-HMDB | Precision@0.8 | 0.014 | Gavrilyuk et al. |