Temporal Aggregate Representations for Long-Range Video Understanding

Fadime Sener, Dipika Singhania, Angela Yao

2020-06-01ECCV 2020 8Action Anticipation Future prediction Video Segmentation Video Semantic Segmentation Video Understanding Action Recognition

Paper PDF Code(official)Code

Abstract

Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	Assembly101	Actions Recall@5	8.53	TempAgg
Activity Recognition	Assembly101	Objects Recall@5	26.27	TempAgg
Activity Recognition	Assembly101	Verbs Recall@5	59.11	TempAgg
Action Recognition	Assembly101	Actions Recall@5	8.53	TempAgg
Action Recognition	Assembly101	Objects Recall@5	26.27	TempAgg
Action Recognition	Assembly101	Verbs Recall@5	59.11	TempAgg
Action Anticipation	Assembly101	Actions Recall@5	8.53	TempAgg
Action Anticipation	Assembly101	Objects Recall@5	26.27	TempAgg
Action Anticipation	Assembly101	Verbs Recall@5	59.11	TempAgg
2D Human Pose Estimation	Assembly101	Actions Recall@5	8.53	TempAgg
2D Human Pose Estimation	Assembly101	Objects Recall@5	26.27	TempAgg
2D Human Pose Estimation	Assembly101	Verbs Recall@5	59.11	TempAgg
Action Recognition In Videos	Assembly101	Actions Recall@5	8.53	TempAgg
Action Recognition In Videos	Assembly101	Objects Recall@5	26.27	TempAgg
Action Recognition In Videos	Assembly101	Verbs Recall@5	59.11	TempAgg

Temporal Aggregate Representations for Long-Range Video Understanding

Abstract

Results

Related Papers

Temporal Aggregate Representations for Long-Range Video Understanding

Abstract

Results

Related Papers