Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He
Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code is available at https://github.com/facebookresearch/video-nonlocal-net .
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Toyota Smarthome dataset | CS | 53.6 | I3D + Non Local |
| Video | Toyota Smarthome dataset | CV1 | 34.3 | I3D + Non Local |
| Video | Toyota Smarthome dataset | CV2 | 43.9 | I3D + Non Local |
| Video | Kinetics-400 | Acc@1 | 77.7 | I3D + NL |
| Video | Kinetics-400 | Acc@5 | 93.3 | I3D + NL |
| Activity Recognition | Something-Something V1 | Top 1 Accuracy | 44.4 | NL I3D |
| Pose Estimation | COCO (Common Objects in Context) | Validation AP | 66.5 | Mask R-CNN + NL blocks (4 in head, 1 in backbone) |
| Object Detection | COCO minival | AP50 | 67.8 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| Object Detection | COCO minival | AP75 | 48.9 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| Object Detection | COCO minival | box AP | 45 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| Object Detection | COCO minival | AP50 | 63.1 | Mask R-CNN (ResNet-101 + 1 NL) |
| Object Detection | COCO minival | AP75 | 44.5 | Mask R-CNN (ResNet-101 + 1 NL) |
| Object Detection | COCO minival | box AP | 40.8 | Mask R-CNN (ResNet-101 + 1 NL) |
| Object Detection | COCO minival | AP50 | 61.1 | Mask R-CNN (ResNet-50 + 1 NL) |
| Object Detection | COCO minival | AP75 | 41.9 | Mask R-CNN (ResNet-50 + 1 NL) |
| Object Detection | COCO minival | box AP | 39 | Mask R-CNN (ResNet-50 + 1 NL) |
| 3D | COCO minival | AP50 | 67.8 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 3D | COCO minival | AP75 | 48.9 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 3D | COCO minival | box AP | 45 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 3D | COCO minival | AP50 | 63.1 | Mask R-CNN (ResNet-101 + 1 NL) |
| 3D | COCO minival | AP75 | 44.5 | Mask R-CNN (ResNet-101 + 1 NL) |
| 3D | COCO minival | box AP | 40.8 | Mask R-CNN (ResNet-101 + 1 NL) |
| 3D | COCO minival | AP50 | 61.1 | Mask R-CNN (ResNet-50 + 1 NL) |
| 3D | COCO minival | AP75 | 41.9 | Mask R-CNN (ResNet-50 + 1 NL) |
| 3D | COCO minival | box AP | 39 | Mask R-CNN (ResNet-50 + 1 NL) |
| 3D | COCO (Common Objects in Context) | Validation AP | 66.5 | Mask R-CNN + NL blocks (4 in head, 1 in backbone) |
| Instance Segmentation | COCO minival | mask AP | 40.3 | Mask R-CNN (ResNext-152, +1 NL) |
| Instance Segmentation | COCO minival | mask AP | 37.1 | Mask R-CNN (ResNet-101, +1 NL) |
| Instance Segmentation | COCO minival | mask AP | 35.5 | Mask R-CNN (ResNet-50, +1 NL) |
| Action Recognition | Something-Something V1 | Top 1 Accuracy | 44.4 | NL I3D |
| 2D Classification | COCO minival | AP50 | 67.8 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 2D Classification | COCO minival | AP75 | 48.9 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 2D Classification | COCO minival | box AP | 45 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 2D Classification | COCO minival | AP50 | 63.1 | Mask R-CNN (ResNet-101 + 1 NL) |
| 2D Classification | COCO minival | AP75 | 44.5 | Mask R-CNN (ResNet-101 + 1 NL) |
| 2D Classification | COCO minival | box AP | 40.8 | Mask R-CNN (ResNet-101 + 1 NL) |
| 2D Classification | COCO minival | AP50 | 61.1 | Mask R-CNN (ResNet-50 + 1 NL) |
| 2D Classification | COCO minival | AP75 | 41.9 | Mask R-CNN (ResNet-50 + 1 NL) |
| 2D Classification | COCO minival | box AP | 39 | Mask R-CNN (ResNet-50 + 1 NL) |
| 2D Object Detection | COCO minival | AP50 | 67.8 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 2D Object Detection | COCO minival | AP75 | 48.9 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 2D Object Detection | COCO minival | box AP | 45 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 2D Object Detection | COCO minival | AP50 | 63.1 | Mask R-CNN (ResNet-101 + 1 NL) |
| 2D Object Detection | COCO minival | AP75 | 44.5 | Mask R-CNN (ResNet-101 + 1 NL) |
| 2D Object Detection | COCO minival | box AP | 40.8 | Mask R-CNN (ResNet-101 + 1 NL) |
| 2D Object Detection | COCO minival | AP50 | 61.1 | Mask R-CNN (ResNet-50 + 1 NL) |
| 2D Object Detection | COCO minival | AP75 | 41.9 | Mask R-CNN (ResNet-50 + 1 NL) |
| 2D Object Detection | COCO minival | box AP | 39 | Mask R-CNN (ResNet-50 + 1 NL) |
| 1 Image, 2*2 Stitchi | COCO (Common Objects in Context) | Validation AP | 66.5 | Mask R-CNN + NL blocks (4 in head, 1 in backbone) |
| 16k | COCO minival | AP50 | 67.8 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 16k | COCO minival | AP75 | 48.9 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 16k | COCO minival | box AP | 45 | Mask R-CNN (ResNeXt-152 + 1 NL) |
| 16k | COCO minival | AP50 | 63.1 | Mask R-CNN (ResNet-101 + 1 NL) |
| 16k | COCO minival | AP75 | 44.5 | Mask R-CNN (ResNet-101 + 1 NL) |
| 16k | COCO minival | box AP | 40.8 | Mask R-CNN (ResNet-101 + 1 NL) |
| 16k | COCO minival | AP50 | 61.1 | Mask R-CNN (ResNet-50 + 1 NL) |
| 16k | COCO minival | AP75 | 41.9 | Mask R-CNN (ResNet-50 + 1 NL) |
| 16k | COCO minival | box AP | 39 | Mask R-CNN (ResNet-50 + 1 NL) |