Yaser Souri, Mohsen Fayyaz, Luca Minciullo, Gianpiero Francesca, Juergen Gall
Action segmentation is the task of predicting the actions for each frame of a video. As obtaining the full annotation of videos for action segmentation is expensive, weakly supervised approaches that can learn only from transcripts are appealing. In this paper, we propose a novel end-to-end approach for weakly supervised action segmentation based on a two-branch neural network. The two branches of our network predict two redundant but different representations for action segmentation and we propose a novel mutual consistency (MuCon) loss that enforces the consistency of the two redundant representations. Using the MuCon loss together with a loss for transcript prediction, our proposed approach achieves the accuracy of state-of-the-art approaches while being $14$ times faster to train and $20$ times faster during inference. The MuCon loss proves beneficial even in the fully supervised setting.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Action Localization | Breakfast | Acc | 62.8 | MuCon |
| Action Localization | Breakfast | Average F1 | 62.6 | MuCon |
| Action Localization | Breakfast | Edit | 76.3 | MuCon |
| Action Localization | Breakfast | F1@10% | 73.2 | MuCon |
| Action Localization | Breakfast | F1@25% | 66.1 | MuCon |
| Action Localization | Breakfast | F1@50% | 48.4 | MuCon |
| Action Localization | Breakfast | Acc | 48.5 | MuCon |
| Action Segmentation | Breakfast | Acc | 62.8 | MuCon |
| Action Segmentation | Breakfast | Average F1 | 62.6 | MuCon |
| Action Segmentation | Breakfast | Edit | 76.3 | MuCon |
| Action Segmentation | Breakfast | F1@10% | 73.2 | MuCon |
| Action Segmentation | Breakfast | F1@25% | 66.1 | MuCon |
| Action Segmentation | Breakfast | F1@50% | 48.4 | MuCon |
| Action Segmentation | Breakfast | Acc | 48.5 | MuCon |