Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann Lecun, Manohar Paluri
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Kinetics-400 | Acc@1 | 75.4 | R[2+1]D-Flow (Sports-1M pretrain) |
| Video | Kinetics-400 | Acc@5 | 91.9 | R[2+1]D-Flow (Sports-1M pretrain) |
| Video | Kinetics-400 | Acc@1 | 74.3 | R[2+1]D-RGB (Sports-1M pretrain) |
| Video | Kinetics-400 | Acc@5 | 91.4 | R[2+1]D-RGB (Sports-1M pretrain) |
| Video | Kinetics-400 | Acc@1 | 73.9 | R[2+1]D-Two-Stream |
| Video | Kinetics-400 | Acc@5 | 90.9 | R[2+1]D-Two-Stream |
| Video | Kinetics-400 | Acc@1 | 72 | R[2+1]D |
| Video | Kinetics-400 | Acc@5 | 90 | R[2+1]D |
| Video | Kinetics-400 | Acc@1 | 72 | R[2+1]D-RGB |
| Video | Kinetics-400 | Acc@5 | 90 | R[2+1]D-RGB |
| Video | Kinetics-400 | Acc@1 | 67.5 | R[2+1]D-Flow |
| Video | Kinetics-400 | Acc@5 | 87.2 | R[2+1]D-Flow |
| Activity Recognition | Sports-1M | Video hit@1 | 73.3 | R[2+1]D-Two-Stream-32frame |
| Activity Recognition | Sports-1M | Video hit@5 | 91.9 | R[2+1]D-Two-Stream-32frame |
| Activity Recognition | Sports-1M | Clip Hit@1 | 57 | R[2+1]D-RGB-32frame |
| Activity Recognition | Sports-1M | Video hit@1 | 73 | R[2+1]D-RGB-32frame |
| Activity Recognition | Sports-1M | Video hit@5 | 91.5 | R[2+1]D-RGB-32frame |
| Activity Recognition | Sports-1M | Clip Hit@1 | 46.4 | R[2+1]D-Flow-32frame |
| Activity Recognition | Sports-1M | Video hit@1 | 68.4 | R[2+1]D-Flow-32frame |
| Activity Recognition | Sports-1M | Video hit@5 | 88.7 | R[2+1]D-Flow-32frame |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 78.7 | R[2+1]D-TwoStream (Kinetics pretrained) |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 76.4 | R[2+1]D-Flow (Kinetics pretrained) |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 74.5 | R[2+1]D-RGB (Kinetics pretrained) |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 72.7 | R[2+1D]D-TwoStream (Sports1M pretrained) |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 70.1 | R[2+1]D-Flow (Sports1M pretrained) |
| Activity Recognition | HMDB-51 | Average accuracy of 3 splits | 66.6 | R[2+1]D-RGB (Sports1M pretrained) |
| Activity Recognition | UCF101 | 3-fold Accuracy | 97.3 | R[2+1]D-TwoStream (Kinetics pretrained) |
| Activity Recognition | UCF101 | 3-fold Accuracy | 96.8 | R[2+1]D-RGB (Kinetics pretrained) |
| Activity Recognition | UCF101 | 3-fold Accuracy | 95.5 | R[2+1]D-Flow (Kinetics pretrained) |
| Activity Recognition | UCF101 | 3-fold Accuracy | 95 | R[2+1]D-TwoStream (Sports-1M pretrained) |
| Activity Recognition | UCF101 | 3-fold Accuracy | 93.6 | R[2+1]D-RGB (Sports-1M pretrained) |
| Activity Recognition | UCF101 | 3-fold Accuracy | 93.3 | R[2+1]D-Flow (Sports-1M pretrained) |
| Action Recognition | Sports-1M | Video hit@1 | 73.3 | R[2+1]D-Two-Stream-32frame |
| Action Recognition | Sports-1M | Video hit@5 | 91.9 | R[2+1]D-Two-Stream-32frame |
| Action Recognition | Sports-1M | Clip Hit@1 | 57 | R[2+1]D-RGB-32frame |
| Action Recognition | Sports-1M | Video hit@1 | 73 | R[2+1]D-RGB-32frame |
| Action Recognition | Sports-1M | Video hit@5 | 91.5 | R[2+1]D-RGB-32frame |
| Action Recognition | Sports-1M | Clip Hit@1 | 46.4 | R[2+1]D-Flow-32frame |
| Action Recognition | Sports-1M | Video hit@1 | 68.4 | R[2+1]D-Flow-32frame |
| Action Recognition | Sports-1M | Video hit@5 | 88.7 | R[2+1]D-Flow-32frame |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 78.7 | R[2+1]D-TwoStream (Kinetics pretrained) |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 76.4 | R[2+1]D-Flow (Kinetics pretrained) |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 74.5 | R[2+1]D-RGB (Kinetics pretrained) |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 72.7 | R[2+1D]D-TwoStream (Sports1M pretrained) |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 70.1 | R[2+1]D-Flow (Sports1M pretrained) |
| Action Recognition | HMDB-51 | Average accuracy of 3 splits | 66.6 | R[2+1]D-RGB (Sports1M pretrained) |
| Action Recognition | UCF101 | 3-fold Accuracy | 97.3 | R[2+1]D-TwoStream (Kinetics pretrained) |
| Action Recognition | UCF101 | 3-fold Accuracy | 96.8 | R[2+1]D-RGB (Kinetics pretrained) |
| Action Recognition | UCF101 | 3-fold Accuracy | 95.5 | R[2+1]D-Flow (Kinetics pretrained) |
| Action Recognition | UCF101 | 3-fold Accuracy | 95 | R[2+1]D-TwoStream (Sports-1M pretrained) |
| Action Recognition | UCF101 | 3-fold Accuracy | 93.6 | R[2+1]D-RGB (Sports-1M pretrained) |
| Action Recognition | UCF101 | 3-fold Accuracy | 93.3 | R[2+1]D-Flow (Sports-1M pretrained) |