Fadime Sener, Dipika Singhania, Angela Yao
Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Activity Recognition | Assembly101 | Actions Recall@5 | 8.53 | TempAgg |
| Activity Recognition | Assembly101 | Objects Recall@5 | 26.27 | TempAgg |
| Activity Recognition | Assembly101 | Verbs Recall@5 | 59.11 | TempAgg |
| Action Recognition | Assembly101 | Actions Recall@5 | 8.53 | TempAgg |
| Action Recognition | Assembly101 | Objects Recall@5 | 26.27 | TempAgg |
| Action Recognition | Assembly101 | Verbs Recall@5 | 59.11 | TempAgg |
| Action Anticipation | Assembly101 | Actions Recall@5 | 8.53 | TempAgg |
| Action Anticipation | Assembly101 | Objects Recall@5 | 26.27 | TempAgg |
| Action Anticipation | Assembly101 | Verbs Recall@5 | 59.11 | TempAgg |
| 2D Human Pose Estimation | Assembly101 | Actions Recall@5 | 8.53 | TempAgg |
| 2D Human Pose Estimation | Assembly101 | Objects Recall@5 | 26.27 | TempAgg |
| 2D Human Pose Estimation | Assembly101 | Verbs Recall@5 | 59.11 | TempAgg |
| Action Recognition In Videos | Assembly101 | Actions Recall@5 | 8.53 | TempAgg |
| Action Recognition In Videos | Assembly101 | Objects Recall@5 | 26.27 | TempAgg |
| Action Recognition In Videos | Assembly101 | Verbs Recall@5 | 59.11 | TempAgg |