Co-training Transformer with Videos and Images Improves Action Recognition

BoWen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M. Dai, Ruoming Pang, Fei Sha

2021-12-14Action Classification Object Recognition Video Classification Action Recognition Action Recognition In Videos

Abstract

In learning action recognition, models are typically pre-trained on object recognition with images, such as ImageNet, and later fine-tuned on target action recognition with videos. This approach has achieved good empirical performance especially with recent transformer-based video architectures. While recently many works aim to design more advanced transformer architectures for action recognition, less effort has been made on how to train video transformers. In this work, we explore several training paradigms and present two findings. First, video transformers benefit from joint training on diverse video datasets and label spaces (e.g., Kinetics is appearance-focused while SomethingSomething is motion-focused). Second, by further co-training with images (as single-frame videos), the video transformers learn even better video representations. We term this approach as Co-training Videos and Images for Action Recognition (CoVeR). In particular, when pretrained on ImageNet-21K based on the TimeSFormer architecture, CoVeR improves Kinetics-400 Top-1 Accuracy by 2.4%, Kinetics-600 by 2.3%, and SomethingSomething-v2 by 2.3%. When pretrained on larger-scale image datasets following previous state-of-the-art, CoVeR achieves best results on Kinetics-400 (87.2%), Kinetics-600 (87.9%), Kinetics-700 (79.8%), SomethingSomething-v2 (70.9%), and Moments-in-Time (46.1%), with a simple spatio-temporal video transformer.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-700	Top-1 Accuracy	79.8	CoVeR (JFT-3B)
Video	Kinetics-700	Top-5 Accuracy	94.9	CoVeR (JFT-3B)
Video	Kinetics-700	Top-1 Accuracy	78.5	CoVeR (JFT-300M)
Video	Kinetics-700	Top-5 Accuracy	94.2	CoVeR (JFT-300M)
Video	MiT	Top 1 Accuracy	46.1	CoVeR(JFT-3B)
Video	MiT	Top 5 Accuracy	75.4	CoVeR(JFT-3B)
Video	MiT	Top 1 Accuracy	45	CoVeR(JFT-300M)
Video	MiT	Top 5 Accuracy	73.9	CoVeR(JFT-300M)
Video	Kinetics-400	Acc@1	87.2	CoVeR (JFT-3B)
Video	Kinetics-400	Acc@5	97.5	CoVeR (JFT-3B)
Video	Kinetics-400	Acc@1	86.3	CoVeR (JFT-300M)
Video	Kinetics-400	Acc@5	97.2	CoVeR (JFT-300M)
Video	Kinetics-600	Top-1 Accuracy	87.9	CoVeR (JFT-3B)
Video	Kinetics-600	Top-5 Accuracy	97.8	CoVeR (JFT-3B)
Video	Kinetics-600	Top-1 Accuracy	86.8	CoVeR (JFT-300M)
Video	Kinetics-600	Top-5 Accuracy	97.3	CoVeR (JFT-300M)
Activity Recognition	Something-Something V2	Top-1 Accuracy	70.9	CoVeR(JFT-3B)
Activity Recognition	Something-Something V2	Top-5 Accuracy	92.5	CoVeR(JFT-3B)
Activity Recognition	Something-Something V2	Top-1 Accuracy	69.8	CoVeR(JFT-300M)
Activity Recognition	Something-Something V2	Top-5 Accuracy	91.9	CoVeR(JFT-300M)
Action Recognition	Something-Something V2	Top-1 Accuracy	70.9	CoVeR(JFT-3B)
Action Recognition	Something-Something V2	Top-5 Accuracy	92.5	CoVeR(JFT-3B)
Action Recognition	Something-Something V2	Top-1 Accuracy	69.8	CoVeR(JFT-300M)
Action Recognition	Something-Something V2	Top-5 Accuracy	91.9	CoVeR(JFT-300M)

Co-training Transformer with Videos and Images Improves Action Recognition

Abstract

Results

Related Papers

Co-training Transformer with Videos and Images Improves Action Recognition

Abstract

Results

Related Papers