Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/facebookresearch/SlowFast
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Charades | MAP | 45.2 | SlowFast (Kinetics-600 pretraining, NL) |
| Video | Charades | MAP | 42.5 | SlowFast (Kinetics-400 pretraining, NL) |
| Video | Charades | MAP | 42.1 | SlowFast (Kinetics-600 pretraining) |
| Video | Kinetics-400 | Acc@1 | 79.8 | SlowFast 16x8 (ResNet-101 + NL) |
| Video | Kinetics-400 | Acc@1 | 78.9 | SlowFast 16x8 (ResNet-101) |
| Video | Kinetics-400 | Acc@5 | 93.5 | SlowFast 16x8 (ResNet-101) |
| Video | Kinetics-400 | Acc@1 | 77.9 | SlowFast 8x8 (ResNet-101) |
| Video | Kinetics-400 | Acc@5 | 93.2 | SlowFast 8x8 (ResNet-101) |
| Video | Kinetics-400 | Acc@1 | 77 | SlowFast 8x8 (ResNet-50) |
| Video | Kinetics-400 | Acc@5 | 92.6 | SlowFast 8x8 (ResNet-50) |
| Video | Kinetics-400 | Acc@1 | 75.6 | SlowFast 4x16 (ResNet-50) |
| Video | Kinetics-400 | Acc@5 | 92.1 | SlowFast 4x16 (ResNet-50) |
| Video | Kinetics-400 | Acc@5 | 93.9 | SlowFast 16x8 (ResNet-101 + NL) |
| Video | Kinetics-600 | Top-1 Accuracy | 81.8 | SlowFast 16x8 (ResNet-101 + NL) |
| Video | Kinetics-600 | Top-5 Accuracy | 95.1 | SlowFast 16x8 (ResNet-101 + NL) |
| Video | Kinetics-600 | Top-1 Accuracy | 81.1 | SlowFast 16x8 (ResNet-101) |
| Video | Kinetics-600 | Top-5 Accuracy | 95.1 | SlowFast 16x8 (ResNet-101) |
| Video | Kinetics-600 | Top-1 Accuracy | 80.4 | SlowFast 8x8 (ResNet-101) |
| Video | Kinetics-600 | Top-5 Accuracy | 94.8 | SlowFast 8x8 (ResNet-101) |
| Video | Kinetics-600 | Top-1 Accuracy | 79.9 | SlowFast 8x8 (ResNet-50) |
| Video | Kinetics-600 | Top-5 Accuracy | 94.5 | SlowFast 8x8 (ResNet-50) |
| Video | Kinetics-600 | Top-1 Accuracy | 78.8 | SlowFast 4x16 (ResNet-50) |
| Video | Kinetics-600 | Top-5 Accuracy | 94 | SlowFast 4x16 (ResNet-50) |
| Activity Recognition | Diving-48 | Accuracy | 77.6 | SlowFast |
| Activity Recognition | AVA v2.1 | mAP (Val) | 28.3 | SlowFast++ (Kinetics-600 pretraining, NL) |
| Activity Recognition | AVA v2.1 | mAP (Val) | 27.3 | SlowFast (Kinetics-600 pretraining, NL) |
| Activity Recognition | AVA v2.1 | mAP (Val) | 26.8 | SlowFast (Kinetics-600 pretraining) |
| Activity Recognition | AVA v2.1 | mAP (Val) | 26.3 | SlowFast (Kinetics-400 pretraining) |
| Activity Recognition | Something-Something V2 | Top-1 Accuracy | 61.7 | SlowFast |
| Activity Recognition | H2O (2 Hands and Objects) | Actions Top-1 | 77.69 | SlowFast |
| Activity Recognition | AVA v2.2 | mAP | 27.5 | SlowFast, 16x8 R101+NL (Kinetics-600 pretraining) |
| Activity Recognition | AVA v2.2 | mAP | 27.1 | SlowFast, 8x8 R101+NL (Kinetics-600 pretraining) |
| Activity Recognition | AVA v2.2 | mAP | 23.8 | SlowFast, 8x8, R101 (Kinetics-400 pretraining) |
| Activity Recognition | AVA v2.2 | mAP | 21.9 | SlowFast, 4x16, R50 (Kinetics-400 pretraining) |
| Action Recognition | Diving-48 | Accuracy | 77.6 | SlowFast |
| Action Recognition | AVA v2.1 | mAP (Val) | 28.3 | SlowFast++ (Kinetics-600 pretraining, NL) |
| Action Recognition | AVA v2.1 | mAP (Val) | 27.3 | SlowFast (Kinetics-600 pretraining, NL) |
| Action Recognition | AVA v2.1 | mAP (Val) | 26.8 | SlowFast (Kinetics-600 pretraining) |
| Action Recognition | AVA v2.1 | mAP (Val) | 26.3 | SlowFast (Kinetics-400 pretraining) |
| Action Recognition | Something-Something V2 | Top-1 Accuracy | 61.7 | SlowFast |
| Action Recognition | H2O (2 Hands and Objects) | Actions Top-1 | 77.69 | SlowFast |
| Action Recognition | AVA v2.2 | mAP | 27.5 | SlowFast, 16x8 R101+NL (Kinetics-600 pretraining) |
| Action Recognition | AVA v2.2 | mAP | 27.1 | SlowFast, 8x8 R101+NL (Kinetics-600 pretraining) |
| Action Recognition | AVA v2.2 | mAP | 23.8 | SlowFast, 8x8, R101 (Kinetics-400 pretraining) |
| Action Recognition | AVA v2.2 | mAP | 21.9 | SlowFast, 4x16, R50 (Kinetics-400 pretraining) |