X3D: Expanding Architectures for Efficient Video Recognition

Christoph Feichtenhofer

2020-04-09CVPR 2020 6Image Classification Action Classification feature selection Video Recognition Video Classification General Classification

Paper PDF Code Code Code(official)Code Code Code Code Code

Abstract

This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth. Inspired by feature selection methods in machine learning, a simple stepwise network expansion approach is employed that expands a single axis in each step, such that good accuracy to complexity trade-off is achieved. To expand X3D to a specific target complexity, we perform progressive forward expansion followed by backward contraction. X3D achieves state-of-the-art performance while requiring 4.8x and 5.5x fewer multiply-adds and parameters for similar accuracy as previous work. Our most surprising finding is that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters. We report competitive accuracy at unprecedented efficiency on video classification and detection benchmarks. Code will be available at: https://github.com/facebookresearch/SlowFast

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-400	Acc@1	80.4	X3D-XXL
Video	Kinetics-400	Acc@5	94.6	X3D-XXL
Video	Kinetics-400	Acc@1	79.1	X3D-XL
Video	Kinetics-400	Acc@5	93.9	X3D-XL
Video	Kinetics-400	Acc@1	77.5	X3D-L
Video	Kinetics-400	Acc@5	92.9	X3D-L
Video	Kinetics-400	Acc@1	76	X3D-M
Video	Kinetics-400	Acc@5	92.3	X3D-M

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18 Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17 Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17 Federated Learning for Commercial Image Sources2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17 mNARX+: A surrogate model for complex dynamical systems using manifold-NARX and automatic feature selection2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15