Yazan Abu Farha, Juergen Gall
Temporally locating and classifying action segments in long untrimmed videos is of particular interest to many applications like surveillance and robotics. While traditional approaches follow a two-step pipeline, by generating frame-wise probabilities and then feeding them to high-level temporal models, recent approaches use temporal convolutions to directly classify the video frames. In this paper, we introduce a multi-stage architecture for the temporal action segmentation task. Each stage features a set of dilated temporal convolutions to generate an initial prediction that is refined by the next one. This architecture is trained using a combination of a classification loss and a proposed smoothing loss that penalizes over-segmentation errors. Extensive evaluation shows the effectiveness of the proposed model in capturing long-range dependencies and recognizing action segments. Our model achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Action Localization | 50 Salads | Acc | 80.7 | MS-TCN |
| Action Localization | 50 Salads | Edit | 67.9 | MS-TCN |
| Action Localization | 50 Salads | F1@10% | 76.3 | MS-TCN |
| Action Localization | 50 Salads | F1@25% | 74 | MS-TCN |
| Action Localization | 50 Salads | F1@50% | 64.5 | MS-TCN |
| Action Localization | GTEA | Acc | 79.2 | MS-TCN |
| Action Localization | GTEA | Edit | 81.4 | MS-TCN |
| Action Localization | GTEA | F1@10% | 87.5 | MS-TCN |
| Action Localization | GTEA | F1@25% | 85.4 | MS-TCN |
| Action Localization | GTEA | F1@50% | 74.6 | MS-TCN |
| Action Localization | Breakfast | Acc | 65.1 | MS-TCN (IDT) |
| Action Localization | Breakfast | Average F1 | 50.6 | MS-TCN (IDT) |
| Action Localization | Breakfast | Edit | 61.4 | MS-TCN (IDT) |
| Action Localization | Breakfast | F1@10% | 58.2 | MS-TCN (IDT) |
| Action Localization | Breakfast | F1@25% | 52.9 | MS-TCN (IDT) |
| Action Localization | Breakfast | F1@50% | 40.8 | MS-TCN (IDT) |
| Action Localization | Breakfast | Acc | 66.3 | MS-TCN (I3D) |
| Action Localization | Breakfast | Average F1 | 46.2 | MS-TCN (I3D) |
| Action Localization | Breakfast | Edit | 61.7 | MS-TCN (I3D) |
| Action Localization | Breakfast | F1@10% | 52.6 | MS-TCN (I3D) |
| Action Localization | Breakfast | F1@25% | 48.1 | MS-TCN (I3D) |
| Action Localization | Breakfast | F1@50% | 37.9 | MS-TCN (I3D) |
| Action Segmentation | 50 Salads | Acc | 80.7 | MS-TCN |
| Action Segmentation | 50 Salads | Edit | 67.9 | MS-TCN |
| Action Segmentation | 50 Salads | F1@10% | 76.3 | MS-TCN |
| Action Segmentation | 50 Salads | F1@25% | 74 | MS-TCN |
| Action Segmentation | 50 Salads | F1@50% | 64.5 | MS-TCN |
| Action Segmentation | GTEA | Acc | 79.2 | MS-TCN |
| Action Segmentation | GTEA | Edit | 81.4 | MS-TCN |
| Action Segmentation | GTEA | F1@10% | 87.5 | MS-TCN |
| Action Segmentation | GTEA | F1@25% | 85.4 | MS-TCN |
| Action Segmentation | GTEA | F1@50% | 74.6 | MS-TCN |
| Action Segmentation | Breakfast | Acc | 65.1 | MS-TCN (IDT) |
| Action Segmentation | Breakfast | Average F1 | 50.6 | MS-TCN (IDT) |
| Action Segmentation | Breakfast | Edit | 61.4 | MS-TCN (IDT) |
| Action Segmentation | Breakfast | F1@10% | 58.2 | MS-TCN (IDT) |
| Action Segmentation | Breakfast | F1@25% | 52.9 | MS-TCN (IDT) |
| Action Segmentation | Breakfast | F1@50% | 40.8 | MS-TCN (IDT) |
| Action Segmentation | Breakfast | Acc | 66.3 | MS-TCN (I3D) |
| Action Segmentation | Breakfast | Average F1 | 46.2 | MS-TCN (I3D) |
| Action Segmentation | Breakfast | Edit | 61.7 | MS-TCN (I3D) |
| Action Segmentation | Breakfast | F1@10% | 52.6 | MS-TCN (I3D) |
| Action Segmentation | Breakfast | F1@25% | 48.1 | MS-TCN (I3D) |
| Action Segmentation | Breakfast | F1@50% | 37.9 | MS-TCN (I3D) |