Is Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius, Heng Wang, Lorenzo Torresani

2021-02-09Action Classification Anomaly Detection Video Question Answering All Video Classification General Classification Video Understanding Action Recognition

Paper PDF Code Code Code(official)Code Code Code Code Code Code Code Code Code Code Code Code Code

Abstract

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: https://github.com/facebookresearch/TimeSformer.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-400	Acc@1	80.7	TimeSformer-L
Video	Kinetics-400	Acc@5	94.7	TimeSformer-L
Video	Kinetics-400	Parameters (M)	121.4	TimeSformer-L
Video	Kinetics-400	Acc@1	79.7	TimeSformer-HR
Video	Kinetics-400	Acc@5	94.4	TimeSformer-HR
Video	Kinetics-400	Acc@1	78	TimeSformer
Video	Kinetics-400	Acc@5	93.7	TimeSformer
Anomaly Detection	UBnormal	RBDC	0.04	TimeSformer
Anomaly Detection	UBnormal	TBDC	0.05	TimeSformer
Video Question Answering	Howto100M-QA	Accuracy	62.1	TimeSformer
Activity Recognition	Diving-48	Accuracy	81	TimeSformer-L
Activity Recognition	Diving-48	Accuracy	78	TimeSformer-HR
Activity Recognition	Diving-48	Accuracy	75	TimeSformer
Activity Recognition	Something-Something V2	Top-1 Accuracy	62.5	TimeSformer-HR
Activity Recognition	Something-Something V2	Top-1 Accuracy	62.3	TimeSformer-L
Activity Recognition	Something-Something V2	Top-1 Accuracy	59.5	TimeSformer
Action Recognition	Diving-48	Accuracy	81	TimeSformer-L
Action Recognition	Diving-48	Accuracy	78	TimeSformer-HR
Action Recognition	Diving-48	Accuracy	75	TimeSformer
Action Recognition	Something-Something V2	Top-1 Accuracy	62.5	TimeSformer-HR
Action Recognition	Something-Something V2	Top-1 Accuracy	62.3	TimeSformer-L
Action Recognition	Something-Something V2	Top-1 Accuracy	59.5	TimeSformer

Is Space-Time Attention All You Need for Video Understanding?

Abstract

Results

Related Papers

Is Space-Time Attention All You Need for Video Understanding?

Abstract

Results

Related Papers