Jiahui Wang, Zhenyou Wang, Shanna Zhuang, Hui Wang
Temporal convolutions have been the paradigm of choice in action segmentation, which enhances long-term receptive fields by increasing convolution layers. However, high layers cause the loss of local information necessary for frame recognition. To solve the above problem, a novel encoder-decoder structure is proposed in this paper, called Cross-Enhancement Transformer. Our approach can be effective learning of temporal structure representation with interactive self-attention mechanism. Concatenated each layer convolutional feature maps in encoder with a set of features in decoder produced via self-attention. Therefore, local and global information are used in a series of frame actions simultaneously. In addition, a new loss function is proposed to enhance the training process that penalizes over-segmentation errors. Experiments show that our framework performs state-of-the-art on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities and the Breakfast dataset.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Action Localization | 50 Salads | Acc | 86.9 | CETNet |
| Action Localization | 50 Salads | Edit | 81.7 | CETNet |
| Action Localization | 50 Salads | F1@10% | 87.6 | CETNet |
| Action Localization | 50 Salads | F1@25% | 86.5 | CETNet |
| Action Localization | 50 Salads | F1@50% | 80.1 | CETNet |
| Action Localization | GTEA | Acc | 80.3 | CETNet |
| Action Localization | GTEA | Edit | 87.9 | CETNet |
| Action Localization | GTEA | F1@10% | 91.8 | CETNet |
| Action Localization | GTEA | F1@25% | 91.2 | CETNet |
| Action Localization | GTEA | F1@50% | 81.3 | CETNet |
| Action Localization | Breakfast | Acc | 74.9 | CETNet |
| Action Localization | Breakfast | Average F1 | 71.8 | CETNet |
| Action Localization | Breakfast | Edit | 77.8 | CETNet |
| Action Localization | Breakfast | F1@10% | 79.3 | CETNet |
| Action Localization | Breakfast | F1@25% | 74.3 | CETNet |
| Action Localization | Breakfast | F1@50% | 61.9 | CETNet |
| Action Segmentation | 50 Salads | Acc | 86.9 | CETNet |
| Action Segmentation | 50 Salads | Edit | 81.7 | CETNet |
| Action Segmentation | 50 Salads | F1@10% | 87.6 | CETNet |
| Action Segmentation | 50 Salads | F1@25% | 86.5 | CETNet |
| Action Segmentation | 50 Salads | F1@50% | 80.1 | CETNet |
| Action Segmentation | GTEA | Acc | 80.3 | CETNet |
| Action Segmentation | GTEA | Edit | 87.9 | CETNet |
| Action Segmentation | GTEA | F1@10% | 91.8 | CETNet |
| Action Segmentation | GTEA | F1@25% | 91.2 | CETNet |
| Action Segmentation | GTEA | F1@50% | 81.3 | CETNet |
| Action Segmentation | Breakfast | Acc | 74.9 | CETNet |
| Action Segmentation | Breakfast | Average F1 | 71.8 | CETNet |
| Action Segmentation | Breakfast | Edit | 77.8 | CETNet |
| Action Segmentation | Breakfast | F1@10% | 79.3 | CETNet |
| Action Segmentation | Breakfast | F1@25% | 74.3 | CETNet |
| Action Segmentation | Breakfast | F1@50% | 61.9 | CETNet |