Multiview Transformers for Video Recognition

Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid

2022-01-12CVPR 2022 1Action Classification Video Understanding Action Recognition

Abstract

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on six standard datasets, and improve even further with large-scale pretraining. Code and checkpoints are available at: https://github.com/google-research/scenic/tree/main/scenic/projects/mtv.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-700	Top-1 Accuracy	83.4	MTV-H (WTS 60M)
Video	Kinetics-700	Top-5 Accuracy	96.2	MTV-H (WTS 60M)
Video	MiT	Top 1 Accuracy	47.2	MTV-H (WTS 60M)
Video	MiT	Top 5 Accuracy	75.7	MTV-H (WTS 60M)
Video	Kinetics-400	Acc@1	89.9	MTV-H (WTS 60M)
Video	Kinetics-400	Acc@5	98.3	MTV-H (WTS 60M)
Video	Kinetics-600	Top-1 Accuracy	90.3	MTV-H (WTS 60M)
Video	Kinetics-600	Top-5 Accuracy	98.5	MTV-H (WTS 60M)
Activity Recognition	EPIC-KITCHENS-100	Action@1	50.5	MTV-B (WTS 60M)
Activity Recognition	EPIC-KITCHENS-100	Noun@1	63.9	MTV-B (WTS 60M)
Activity Recognition	EPIC-KITCHENS-100	Verb@1	69.9	MTV-B (WTS 60M)
Activity Recognition	Something-Something V2	Top-1 Accuracy	68.5	MTV-B
Activity Recognition	Something-Something V2	Top-5 Accuracy	90.4	MTV-B
Action Recognition	EPIC-KITCHENS-100	Action@1	50.5	MTV-B (WTS 60M)
Action Recognition	EPIC-KITCHENS-100	Noun@1	63.9	MTV-B (WTS 60M)
Action Recognition	EPIC-KITCHENS-100	Verb@1	69.9	MTV-B (WTS 60M)
Action Recognition	Something-Something V2	Top-1 Accuracy	68.5	MTV-B
Action Recognition	Something-Something V2	Top-5 Accuracy	90.4	MTV-B

Multiview Transformers for Video Recognition

Abstract

Results

Related Papers

Multiview Transformers for Video Recognition

Abstract

Results

Related Papers