How Much Temporal Long-Term Context is Needed for Action Segmentation?

Emad Bahrami, Gianpiero Francesca, Juergen Gall

2023-08-22ICCV 2023 1Action Segmentation Temporal Action Segmentation Segmentation

Abstract

Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with the current state of the art on three datasets for temporal action segmentation, namely 50Salads, Breakfast, and Assembly101. Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.

Results

Task	Dataset	Metric	Value	Model
Action Localization	50 Salads	Acc	87.7	LTContext
Action Localization	50 Salads	Edit	83.2	LTContext
Action Localization	50 Salads	F1@10%	89.4	LTContext
Action Localization	50 Salads	F1@25%	87.7	LTContext
Action Localization	50 Salads	F1@50%	82	LTContext
Action Localization	Assembly101	Edit	30.4	LTContext
Action Localization	Assembly101	F1@10%	33.9	LTContext
Action Localization	Assembly101	F1@25%	30	LTContext
Action Localization	Assembly101	F1@50%	22.6	LTContext
Action Localization	Assembly101	MoF	41.2	LTContext
Action Localization	Breakfast	Acc	74.2	LTContext
Action Localization	Breakfast	Average F1	70.1	LTContext
Action Localization	Breakfast	Edit	77	LTContext
Action Localization	Breakfast	F1@10%	77.6	LTContext
Action Localization	Breakfast	F1@25%	72.6	LTContext
Action Localization	Breakfast	F1@50%	60.1	LTContext
Action Segmentation	50 Salads	Acc	87.7	LTContext
Action Segmentation	50 Salads	Edit	83.2	LTContext
Action Segmentation	50 Salads	F1@10%	89.4	LTContext
Action Segmentation	50 Salads	F1@25%	87.7	LTContext
Action Segmentation	50 Salads	F1@50%	82	LTContext
Action Segmentation	Assembly101	Edit	30.4	LTContext
Action Segmentation	Assembly101	F1@10%	33.9	LTContext
Action Segmentation	Assembly101	F1@25%	30	LTContext
Action Segmentation	Assembly101	F1@50%	22.6	LTContext
Action Segmentation	Assembly101	MoF	41.2	LTContext
Action Segmentation	Breakfast	Acc	74.2	LTContext
Action Segmentation	Breakfast	Average F1	70.1	LTContext
Action Segmentation	Breakfast	Edit	77	LTContext
Action Segmentation	Breakfast	F1@10%	77.6	LTContext
Action Segmentation	Breakfast	F1@25%	72.6	LTContext
Action Segmentation	Breakfast	F1@50%	60.1	LTContext

How Much Temporal Long-Term Context is Needed for Action Segmentation?

Abstract

Results

Related Papers

How Much Temporal Long-Term Context is Needed for Action Segmentation?

Abstract

Results

Related Papers