TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MOAT: Alternating Mobile Convolution and Attention Brings ...

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

Chenglin Yang, Siyuan Qiao, Qihang Yu, Xiaoding Yuan, Yukun Zhu, Alan Yuille, Hartwig Adam, Liang-Chieh Chen

2022-10-04Image ClassificationSemantic SegmentationInstance SegmentationObject Detection
PaperPDFCode(official)Code

Abstract

This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1% / 81.5% top-1 accuracy on ImageNet-1K / ImageNet-1K-V2 with ImageNet22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to window attention. Thanks to the mobile convolution that effectively exchanges local information between pixels (and thus cross-windows), MOAT does not need the extra window-shifting mechanism. As a result, on COCO object detection, MOAT achieves 59.2% box AP with 227M model parameters (single-scale inference, and hard NMS), and on ADE20K semantic segmentation, MOAT attains 57.6% mIoU with 496M model parameters (single-scale inference). Finally, the tiny-MOAT family, obtained by simply reducing the channel sizes, also surprisingly outperforms several mobile-specific transformer-based models on ImageNet. The tiny-MOAT family is also benchmarked on downstream tasks, serving as a baseline for the community. We hope our simple yet effective MOAT will inspire more seamless integration of convolution and self-attention. Code is publicly available.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KParams (M)496MOAT-4 (IN-22K pretraining, single-scale)
Semantic SegmentationADE20KValidation mIoU57.6MOAT-4 (IN-22K pretraining, single-scale)
Semantic SegmentationADE20KParams (M)198MOAT-3 (IN-22K pretraining, single-scale)
Semantic SegmentationADE20KValidation mIoU56.5MOAT-3 (IN-22K pretraining, single-scale)
Semantic SegmentationADE20KParams (M)81MOAT-2 (IN-22K pretraining, single-scale)
Semantic SegmentationADE20KValidation mIoU54.7MOAT-2 (IN-22K pretraining, single-scale)
Semantic SegmentationADE20KParams (M)24tiny-MOAT-3 (IN-1K pretraining, single scale)
Semantic SegmentationADE20KValidation mIoU47.5tiny-MOAT-3 (IN-1K pretraining, single scale)
Semantic SegmentationADE20KParams (M)13tiny-MOAT-2 (IN-1K pretraining, single scale)
Semantic SegmentationADE20KValidation mIoU44.9tiny-MOAT-2 (IN-1K pretraining, single scale)
Semantic SegmentationADE20KParams (M)8tiny-MOAT-1 (IN-1K pretraining, single scale)
Semantic SegmentationADE20KValidation mIoU43.1tiny-MOAT-1 (IN-1K pretraining, single scale)
Semantic SegmentationADE20KParams (M)6tiny-MOAT-0 (IN-1K pretraining, single scale)
Semantic SegmentationADE20KValidation mIoU41.2tiny-MOAT-0 (IN-1K pretraining, single scale)
Object DetectionCOCO (Common Objects in Context)box AP59.2MOAT-3 22K+1K
Object DetectionCOCO (Common Objects in Context)box AP58.5MOAT-2
Object DetectionCOCO minivalbox AP59.2MOAT-3 (IN-22K pretraining, single-scale)
Object DetectionCOCO minivalbox AP58.5MOAT-2 (IN-22K pretraining, single-scale)
Object DetectionCOCO minivalbox AP57.7MOAT-1 (IN-1K pretraining, single-scale)
Object DetectionCOCO minivalbox AP55.9MOAT-0 (IN-1K pretraining, single-scale)
Object DetectionCOCO minivalbox AP55.2tiny-MOAT-3 (IN-1K pretraining, single-scale)
Object DetectionCOCO minivalbox AP53tiny-MOAT-2 (IN-1K pretraining, single-scale)
Object DetectionCOCO minivalbox AP51.9tiny-MOAT-1 (IN-1K pretraining, single-scale)
Object DetectionCOCO minivalbox AP50.5tiny-MOAT-0 (IN-1K pretraining, single-scale)
Image ClassificationImageNet V2Top 1 Accuracy81.5MOAT-4 (IN-22K pretraining)
Image ClassificationImageNet V2Top 1 Accuracy80.6MOAT-3 (IN-22K pretraining)
Image ClassificationImageNet V2Top 1 Accuracy79.3MOAT-2 (IN-22K pretraining)
Image ClassificationImageNet V2Top 1 Accuracy78.4MOAT-1 (IN-22K pretraining)
Image ClassificationImageNetGFLOPs648.5MOAT-4 22K+1K
Image ClassificationImageNetGFLOPs271MOAT-3 1K only
Image ClassificationImageNetGFLOPs5.7MOAT-0 1K only
3DCOCO (Common Objects in Context)box AP59.2MOAT-3 22K+1K
3DCOCO (Common Objects in Context)box AP58.5MOAT-2
3DCOCO minivalbox AP59.2MOAT-3 (IN-22K pretraining, single-scale)
3DCOCO minivalbox AP58.5MOAT-2 (IN-22K pretraining, single-scale)
3DCOCO minivalbox AP57.7MOAT-1 (IN-1K pretraining, single-scale)
3DCOCO minivalbox AP55.9MOAT-0 (IN-1K pretraining, single-scale)
3DCOCO minivalbox AP55.2tiny-MOAT-3 (IN-1K pretraining, single-scale)
3DCOCO minivalbox AP53tiny-MOAT-2 (IN-1K pretraining, single-scale)
3DCOCO minivalbox AP51.9tiny-MOAT-1 (IN-1K pretraining, single-scale)
3DCOCO minivalbox AP50.5tiny-MOAT-0 (IN-1K pretraining, single-scale)
Instance SegmentationCOCO minivalmask AP50.3MOAT-3 (IN-22K pretraining, single-scale)
Instance SegmentationCOCO minivalmask AP49.3MOAT-2 (IN-22K pretraining, single-scale)
Instance SegmentationCOCO minivalmask AP49MOAT-1 (IN-1K pretraining, single-scale)
Instance SegmentationCOCO minivalmask AP47.4MOAT-0 (IN-1K pretraining, single-scale)
Instance SegmentationCOCO minivalmask AP47tiny-MOAT-3 (IN-1K pretraining, single-scale)
Instance SegmentationCOCO minivalmask AP45tiny-MOAT-2 (IN-1K pretraining, single-scale)
Instance SegmentationCOCO minivalmask AP44.6tiny-MOAT-1 (IN-1K pretraining, single-scale)
Instance SegmentationCOCO minivalmask AP43.3tiny-MOAT-0 (IN-1K pretraining, single-scale)
2D ClassificationCOCO (Common Objects in Context)box AP59.2MOAT-3 22K+1K
2D ClassificationCOCO (Common Objects in Context)box AP58.5MOAT-2
2D ClassificationCOCO minivalbox AP59.2MOAT-3 (IN-22K pretraining, single-scale)
2D ClassificationCOCO minivalbox AP58.5MOAT-2 (IN-22K pretraining, single-scale)
2D ClassificationCOCO minivalbox AP57.7MOAT-1 (IN-1K pretraining, single-scale)
2D ClassificationCOCO minivalbox AP55.9MOAT-0 (IN-1K pretraining, single-scale)
2D ClassificationCOCO minivalbox AP55.2tiny-MOAT-3 (IN-1K pretraining, single-scale)
2D ClassificationCOCO minivalbox AP53tiny-MOAT-2 (IN-1K pretraining, single-scale)
2D ClassificationCOCO minivalbox AP51.9tiny-MOAT-1 (IN-1K pretraining, single-scale)
2D ClassificationCOCO minivalbox AP50.5tiny-MOAT-0 (IN-1K pretraining, single-scale)
2D Object DetectionCOCO (Common Objects in Context)box AP59.2MOAT-3 22K+1K
2D Object DetectionCOCO (Common Objects in Context)box AP58.5MOAT-2
2D Object DetectionCOCO minivalbox AP59.2MOAT-3 (IN-22K pretraining, single-scale)
2D Object DetectionCOCO minivalbox AP58.5MOAT-2 (IN-22K pretraining, single-scale)
2D Object DetectionCOCO minivalbox AP57.7MOAT-1 (IN-1K pretraining, single-scale)
2D Object DetectionCOCO minivalbox AP55.9MOAT-0 (IN-1K pretraining, single-scale)
2D Object DetectionCOCO minivalbox AP55.2tiny-MOAT-3 (IN-1K pretraining, single-scale)
2D Object DetectionCOCO minivalbox AP53tiny-MOAT-2 (IN-1K pretraining, single-scale)
2D Object DetectionCOCO minivalbox AP51.9tiny-MOAT-1 (IN-1K pretraining, single-scale)
2D Object DetectionCOCO minivalbox AP50.5tiny-MOAT-0 (IN-1K pretraining, single-scale)
10-shot image generationADE20KParams (M)496MOAT-4 (IN-22K pretraining, single-scale)
10-shot image generationADE20KValidation mIoU57.6MOAT-4 (IN-22K pretraining, single-scale)
10-shot image generationADE20KParams (M)198MOAT-3 (IN-22K pretraining, single-scale)
10-shot image generationADE20KValidation mIoU56.5MOAT-3 (IN-22K pretraining, single-scale)
10-shot image generationADE20KParams (M)81MOAT-2 (IN-22K pretraining, single-scale)
10-shot image generationADE20KValidation mIoU54.7MOAT-2 (IN-22K pretraining, single-scale)
10-shot image generationADE20KParams (M)24tiny-MOAT-3 (IN-1K pretraining, single scale)
10-shot image generationADE20KValidation mIoU47.5tiny-MOAT-3 (IN-1K pretraining, single scale)
10-shot image generationADE20KParams (M)13tiny-MOAT-2 (IN-1K pretraining, single scale)
10-shot image generationADE20KValidation mIoU44.9tiny-MOAT-2 (IN-1K pretraining, single scale)
10-shot image generationADE20KParams (M)8tiny-MOAT-1 (IN-1K pretraining, single scale)
10-shot image generationADE20KValidation mIoU43.1tiny-MOAT-1 (IN-1K pretraining, single scale)
10-shot image generationADE20KParams (M)6tiny-MOAT-0 (IN-1K pretraining, single scale)
10-shot image generationADE20KValidation mIoU41.2tiny-MOAT-0 (IN-1K pretraining, single scale)
16kCOCO (Common Objects in Context)box AP59.2MOAT-3 22K+1K
16kCOCO (Common Objects in Context)box AP58.5MOAT-2
16kCOCO minivalbox AP59.2MOAT-3 (IN-22K pretraining, single-scale)
16kCOCO minivalbox AP58.5MOAT-2 (IN-22K pretraining, single-scale)
16kCOCO minivalbox AP57.7MOAT-1 (IN-1K pretraining, single-scale)
16kCOCO minivalbox AP55.9MOAT-0 (IN-1K pretraining, single-scale)
16kCOCO minivalbox AP55.2tiny-MOAT-3 (IN-1K pretraining, single-scale)
16kCOCO minivalbox AP53tiny-MOAT-2 (IN-1K pretraining, single-scale)
16kCOCO minivalbox AP51.9tiny-MOAT-1 (IN-1K pretraining, single-scale)
16kCOCO minivalbox AP50.5tiny-MOAT-0 (IN-1K pretraining, single-scale)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17