Ming Xu, Stephen Gould
We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting, where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Action Localization | IKEA ASM | Accuracy | 34 | ASOT |
| Action Localization | IKEA ASM | F1 | 27.9 | ASOT |
| Action Localization | IKEA ASM | JSD | 88.7 | ASOT |
| Action Localization | IKEA ASM | Precision | 21.1 | ASOT |
| Action Localization | IKEA ASM | Recall | 24 | ASOT |
| Action Localization | Youtube INRIA Instructional | Acc | 52.9 | ASOT |
| Action Localization | Youtube INRIA Instructional | F1 | 35.1 | ASOT |
| Action Localization | Youtube INRIA Instructional | Precision | 47.6 | ASOT |
| Action Localization | Youtube INRIA Instructional | Recall | 27.8 | ASOT |
| Action Localization | Youtube INRIA Instructional | mIoU | 24.7 | ASOT |
| Action Localization | Breakfast | Acc | 56.1 | ASOT |
| Action Localization | Breakfast | F1 | 38.3 | ASOT |
| Action Localization | Breakfast | JSD | 94.9 | ASOT |
| Action Localization | Breakfast | Precision | 36.7 | ASOT |
| Action Localization | Breakfast | Recall | 40.1 | ASOT |
| Action Localization | Breakfast | mIoU | 18.6 | ASOT |
| Action Segmentation | IKEA ASM | Accuracy | 34 | ASOT |
| Action Segmentation | IKEA ASM | F1 | 27.9 | ASOT |
| Action Segmentation | IKEA ASM | JSD | 88.7 | ASOT |
| Action Segmentation | IKEA ASM | Precision | 21.1 | ASOT |
| Action Segmentation | IKEA ASM | Recall | 24 | ASOT |
| Action Segmentation | Youtube INRIA Instructional | Acc | 52.9 | ASOT |
| Action Segmentation | Youtube INRIA Instructional | F1 | 35.1 | ASOT |
| Action Segmentation | Youtube INRIA Instructional | Precision | 47.6 | ASOT |
| Action Segmentation | Youtube INRIA Instructional | Recall | 27.8 | ASOT |
| Action Segmentation | Youtube INRIA Instructional | mIoU | 24.7 | ASOT |
| Action Segmentation | Breakfast | Acc | 56.1 | ASOT |
| Action Segmentation | Breakfast | F1 | 38.3 | ASOT |
| Action Segmentation | Breakfast | JSD | 94.9 | ASOT |
| Action Segmentation | Breakfast | Precision | 36.7 | ASOT |
| Action Segmentation | Breakfast | Recall | 40.1 | ASOT |
| Action Segmentation | Breakfast | mIoU | 18.6 | ASOT |