CT-Net: Channel Tensorization Network for Video Classification

Kunchang Li, Xianhang Li, Yali Wang, Jun Wang, Yu Qiao

2021-06-03ICLR 2021 1Action Classification Video Classification Action Recognition Classification

Abstract

3D convolution is powerful for video classification but often computationally expensive, recent studies mainly focus on decomposing it on spatial-temporal and/or channel dimensions. Unfortunately, most approaches fail to achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency. For this reason, we propose a concise and novel Channel Tensorization Network (CT-Net), by treating the channel dimension of input feature as a multiplication of K sub-dimensions. On one hand, it naturally factorizes convolution in a multiple dimension way, leading to a light computation burden. On the other hand, it can effectively enhance feature interaction from different channels, and progressively enlarge the 3D receptive field of such interaction to boost classification accuracy. Furthermore, we equip our CT-Module with a Tensor Excitation (TE) mechanism. It can learn to exploit spatial, temporal and channel attention in a high-dimensional manner, to improve the cooperative power of all the feature dimensions in our CT-Module. Finally, we flexibly adapt ResNet as our CT-Net. Extensive experiments are conducted on several challenging video benchmarks, e.g., Kinetics-400, Something-Something V1 and V2. Our CT-Net outperforms a number of recent SOTA approaches, in terms of accuracy and/or efficiency. The codes and models will be available on https://github.com/Andy1621/CT-Net.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-400	Acc@1	79.8	CT-Net Ensemble
Activity Recognition	Something-Something V1	Top 1 Accuracy	56.6	CT-Net Ensemble (R50, 8+12+16+24)
Activity Recognition	Something-Something V2	GFLOPs	280	CT-Net Ensemble (R50, 8+12+16+24)
Activity Recognition	Something-Something V2	Parameters	83.8	CT-Net Ensemble (R50, 8+12+16+24)
Activity Recognition	Something-Something V2	Top-1 Accuracy	67.8	CT-Net Ensemble (R50, 8+12+16+24)
Activity Recognition	Something-Something V2	Top-5 Accuracy	91.1	CT-Net Ensemble (R50, 8+12+16+24)
Action Recognition	Something-Something V1	Top 1 Accuracy	56.6	CT-Net Ensemble (R50, 8+12+16+24)
Action Recognition	Something-Something V2	GFLOPs	280	CT-Net Ensemble (R50, 8+12+16+24)
Action Recognition	Something-Something V2	Parameters	83.8	CT-Net Ensemble (R50, 8+12+16+24)
Action Recognition	Something-Something V2	Top-1 Accuracy	67.8	CT-Net Ensemble (R50, 8+12+16+24)
Action Recognition	Something-Something V2	Top-5 Accuracy	91.1	CT-Net Ensemble (R50, 8+12+16+24)

CT-Net: Channel Tensorization Network for Video Classification

Abstract

Results

Related Papers

CT-Net: Channel Tensorization Network for Video Classification

Abstract

Results

Related Papers