Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Piotr Bojanowski, Armand Joulin, Gabriel Synnaeve, Hervé Jégou
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth). In contrast with a pyramidal design, this architecture family maintains the input patch resolution across all the layers. It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption, as shown by our experiments on various computer vision tasks: object classification, image segmentation and detection.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Semantic Segmentation | ADE20K val | mIoU | 52.9 | PatchConvNet-L120 (UperNet) |
| Semantic Segmentation | ADE20K val | mIoU | 52.8 | PatchConvNet-B120 (UperNet) |
| Semantic Segmentation | ADE20K val | mIoU | 51.1 | PatchConvNet-B60 (UperNet) |
| Semantic Segmentation | ADE20K val | mIoU | 49.3 | PatchConvNet-S60 (UperNet) |
| Semantic Segmentation | ADE20K | Validation mIoU | 52.9 | PatchConvNet-L120 (UperNet) |
| Semantic Segmentation | ADE20K | Validation mIoU | 52.8 | PatchConvNet-B120 (UperNet) |
| Semantic Segmentation | ADE20K | Validation mIoU | 51.1 | PatchConvNet-B60 (UperNet) |
| Semantic Segmentation | ADE20K | Validation mIoU | 49.3 | PatchConvNet-S60 (UperNet) |
| Object Detection | COCO minival | box AP | 47 | PatchConvNet-S120 (Mask R-CNN) |
| Object Detection | COCO minival | box AP | 46.4 | PatchConvNet-S60 (Mask R-CNN) |
| 3D | COCO minival | box AP | 47 | PatchConvNet-S120 (Mask R-CNN) |
| 3D | COCO minival | box AP | 46.4 | PatchConvNet-S60 (Mask R-CNN) |
| 2D Classification | COCO minival | box AP | 47 | PatchConvNet-S120 (Mask R-CNN) |
| 2D Classification | COCO minival | box AP | 46.4 | PatchConvNet-S60 (Mask R-CNN) |
| 2D Object Detection | COCO minival | box AP | 47 | PatchConvNet-S120 (Mask R-CNN) |
| 2D Object Detection | COCO minival | box AP | 46.4 | PatchConvNet-S60 (Mask R-CNN) |
| 10-shot image generation | ADE20K val | mIoU | 52.9 | PatchConvNet-L120 (UperNet) |
| 10-shot image generation | ADE20K val | mIoU | 52.8 | PatchConvNet-B120 (UperNet) |
| 10-shot image generation | ADE20K val | mIoU | 51.1 | PatchConvNet-B60 (UperNet) |
| 10-shot image generation | ADE20K val | mIoU | 49.3 | PatchConvNet-S60 (UperNet) |
| 10-shot image generation | ADE20K | Validation mIoU | 52.9 | PatchConvNet-L120 (UperNet) |
| 10-shot image generation | ADE20K | Validation mIoU | 52.8 | PatchConvNet-B120 (UperNet) |
| 10-shot image generation | ADE20K | Validation mIoU | 51.1 | PatchConvNet-B60 (UperNet) |
| 10-shot image generation | ADE20K | Validation mIoU | 49.3 | PatchConvNet-S60 (UperNet) |
| 16k | COCO minival | box AP | 47 | PatchConvNet-S120 (Mask R-CNN) |
| 16k | COCO minival | box AP | 46.4 | PatchConvNet-S60 (Mask R-CNN) |