Siyuan Li, Zedong Wang, Zicheng Liu, Cheng Tan, Haitao Lin, Di wu, ZhiYuan Chen, Jiangbin Zheng, Stan Z. Li
By contextualizing the kernel as global as possible, Modern ConvNets have shown great potential in computer vision tasks. However, recent progress on \textit{multi-order game-theoretic interaction} within deep neural networks (DNNs) reveals the representation bottleneck of modern ConvNets, where the expressive interactions have not been effectively encoded with the increased kernel size. To tackle this challenge, we propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning in pure ConvNet-based models with favorable complexity-performance trade-offs. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module, where discriminative features are efficiently gathered and contextualized adaptively. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance compared to state-of-the-art ViTs and ConvNets on ImageNet and various downstream vision benchmarks, including COCO object detection, ADE20K semantic segmentation, 2D\&3D human pose estimation, and video prediction. Notably, MogaNet hits 80.0\% and 87.8\% accuracy with 5.2M and 181M parameters on ImageNet-1K, outperforming ParC-Net and ConvNeXt-L, while saving 59\% FLOPs and 17M parameters, respectively. The source code is available at \url{https://github.com/Westlake-AI/MogaNet}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Video | Moving MNIST | MAE | 51.84 | MogaNet (SimVP 10x) |
| Video | Moving MNIST | MSE | 15.67 | MogaNet (SimVP 10x) |
| Video | Moving MNIST | SSIM | 0.9661 | MogaNet (SimVP 10x) |
| Video | Moving MNIST | MAE | 53.57 | VAN (SimVP 10x) |
| Video | Moving MNIST | MSE | 16.21 | VAN (SimVP 10x) |
| Video | Moving MNIST | SSIM | 0.9646 | VAN (SimVP 10x) |
| Video | Moving MNIST | MAE | 55.7 | HorNet (SimVP 10x) |
| Video | Moving MNIST | MSE | 17.4 | HorNet (SimVP 10x) |
| Video | Moving MNIST | SSIM | 0.9624 | HorNet (SimVP 10x) |
| Video | Moving MNIST | MAE | 55.76 | ConvNeXt (SimVP 10x) |
| Video | Moving MNIST | MSE | 17.58 | ConvNeXt (SimVP 10x) |
| Video | Moving MNIST | SSIM | 0.9617 | ConvNeXt (SimVP 10x) |
| Video | Moving MNIST | MAE | 57.52 | Uniformer (SimVP 10x) |
| Video | Moving MNIST | MSE | 18.01 | Uniformer (SimVP 10x) |
| Video | Moving MNIST | MAE | 59.86 | MLP-Mixer (SimVP 10x) |
| Video | Moving MNIST | MSE | 18.85 | MLP-Mixer (SimVP 10x) |
| Video | Moving MNIST | MAE | 59.84 | Swin (SimVP 10x) |
| Video | Moving MNIST | MSE | 19.11 | Swin (SimVP 10x) |
| Video | Moving MNIST | MAE | 61.65 | ViT (SimVP 10x) |
| Video | Moving MNIST | MSE | 19.74 | ViT (SimVP 10x) |
| Video | Moving MNIST | SSIM | 0.9539 | ViT (SimVP 10x) |
| Video | Moving MNIST | MAE | 64.31 | Poolformer (SimVP 10x) |
| Video | Moving MNIST | MSE | 20.96 | Poolformer (SimVP 10x) |
| Video | Moving MNIST | MAE | 67.37 | ConvMixer (SimVP 10x) |
| Video | Moving MNIST | MSE | 22.3 | ConvMixer (SimVP 10x) |
| Semantic Segmentation | ADE20K | Validation mIoU | 54 | MogaNet-XL (UperNet) |
| Semantic Segmentation | ADE20K | GFLOPs (512 x 512) | 1176 | MogaNet-L (UperNet) |
| Semantic Segmentation | ADE20K | Validation mIoU | 50.9 | MogaNet-L (UperNet) |
| Semantic Segmentation | ADE20K | GFLOPs (512 x 512) | 1050 | MogaNet-B (UperNet) |
| Semantic Segmentation | ADE20K | Validation mIoU | 50.1 | MogaNet-B (UperNet) |
| Semantic Segmentation | ADE20K | GFLOPs (512 x 512) | 946 | MogaNet-S (UperNet) |
| Semantic Segmentation | ADE20K | Validation mIoU | 49.2 | MogaNet-S (UperNet) |
| Semantic Segmentation | ADE20K | GFLOPs (512 x 512) | 189 | MogaNet-S (Semantic FPN) |
| Semantic Segmentation | ADE20K | Validation mIoU | 47.7 | MogaNet-S (Semantic FPN) |
| Pose Estimation | COCO val2017 | AP | 77.3 | MogaNet-B (384x288) |
| Pose Estimation | COCO val2017 | AP50 | 91.4 | MogaNet-B (384x288) |
| Pose Estimation | COCO val2017 | AP75 | 84 | MogaNet-B (384x288) |
| Pose Estimation | COCO val2017 | AR | 82.2 | MogaNet-B (384x288) |
| Pose Estimation | COCO val2017 | AP | 76.4 | MogaNet-S (384x288) |
| Pose Estimation | COCO val2017 | AP50 | 91 | MogaNet-S (384x288) |
| Pose Estimation | COCO val2017 | AP75 | 83.3 | MogaNet-S (384x288) |
| Pose Estimation | COCO val2017 | AR | 81.4 | MogaNet-S (384x288) |
| Pose Estimation | COCO val2017 | AP | 74.9 | MogaNet-S (256x192) |
| Pose Estimation | COCO val2017 | AR | 80.1 | MogaNet-S (256x192) |
| Pose Estimation | COCO val2017 | AP | 73.2 | MogaNet-T (256x192) |
| Pose Estimation | COCO val2017 | AP50 | 90.1 | MogaNet-T (256x192) |
| Pose Estimation | COCO val2017 | AP75 | 81 | MogaNet-T (256x192) |
| Pose Estimation | COCO val2017 | AR | 78.8 | MogaNet-T (256x192) |
| Object Detection | COCO 2017 val | AP | 56.2 | MogaNet-XL (Cascade Mask R-CNN) |
| Object Detection | COCO 2017 val | AP | 53.3 | MogaNet-L (Cascade Mask R-CNN) |
| Object Detection | COCO 2017 val | AP | 52.6 | MogaNet-B (Cascade Mask R-CNN) |
| Object Detection | COCO 2017 val | AP | 51.6 | MogaNet-S (Cascade Mask R-CNN) |
| Object Detection | COCO 2017 val | AP | 49.4 | MogaNet-L (Mask R-CNN 1x) |
| Object Detection | COCO 2017 val | AP | 48.7 | MogaNet-L (RetinaNet 1x) |
| Object Detection | COCO 2017 val | AP | 47.9 | MogaNet-B (Mask R-CNN 1x) |
| Object Detection | COCO 2017 val | AP | 47.7 | MogaNet-B (RetinaNet 1x) |
| Object Detection | COCO 2017 val | AP | 46.7 | MogaNet-S (Mask R-CNN 1x) |
| Object Detection | COCO 2017 val | AP | 45.8 | MogaNet-S (RetinaNet 1x) |
| Object Detection | COCO 2017 val | AP | 42.6 | MogaNet-T (Mask R-CNN 1x) |
| Object Detection | COCO 2017 val | AP | 41.4 | MogaNet-T (RetinaNet 1x) |
| Object Detection | COCO 2017 val | AP | 40.7 | MogaNet-XT (Mask R-CNN 1x) |
| Object Detection | COCO 2017 val | AP | 39.7 | MogaNet-XT (RetinaNet 1x) |
| Image Classification | ImageNet | GFLOPs | 102 | MogaNet-XL (384res) |
| Image Classification | ImageNet | GFLOPs | 15.9 | MogaNet-L |
| Image Classification | ImageNet | GFLOPs | 9.9 | MogaNet-B |
| Image Classification | ImageNet | GFLOPs | 5 | MogaNet-S |
| Image Classification | ImageNet | GFLOPs | 1.44 | MogaNet-T (256res) |
| Image Classification | ImageNet | GFLOPs | 1.04 | MogaNet-XT (256res) |
| Video Prediction | Moving MNIST | MAE | 51.84 | MogaNet (SimVP 10x) |
| Video Prediction | Moving MNIST | MSE | 15.67 | MogaNet (SimVP 10x) |
| Video Prediction | Moving MNIST | SSIM | 0.9661 | MogaNet (SimVP 10x) |
| Video Prediction | Moving MNIST | MAE | 53.57 | VAN (SimVP 10x) |
| Video Prediction | Moving MNIST | MSE | 16.21 | VAN (SimVP 10x) |
| Video Prediction | Moving MNIST | SSIM | 0.9646 | VAN (SimVP 10x) |
| Video Prediction | Moving MNIST | MAE | 55.7 | HorNet (SimVP 10x) |
| Video Prediction | Moving MNIST | MSE | 17.4 | HorNet (SimVP 10x) |
| Video Prediction | Moving MNIST | SSIM | 0.9624 | HorNet (SimVP 10x) |
| Video Prediction | Moving MNIST | MAE | 55.76 | ConvNeXt (SimVP 10x) |
| Video Prediction | Moving MNIST | MSE | 17.58 | ConvNeXt (SimVP 10x) |
| Video Prediction | Moving MNIST | SSIM | 0.9617 | ConvNeXt (SimVP 10x) |
| Video Prediction | Moving MNIST | MAE | 57.52 | Uniformer (SimVP 10x) |
| Video Prediction | Moving MNIST | MSE | 18.01 | Uniformer (SimVP 10x) |
| Video Prediction | Moving MNIST | MAE | 59.86 | MLP-Mixer (SimVP 10x) |
| Video Prediction | Moving MNIST | MSE | 18.85 | MLP-Mixer (SimVP 10x) |
| Video Prediction | Moving MNIST | MAE | 59.84 | Swin (SimVP 10x) |
| Video Prediction | Moving MNIST | MSE | 19.11 | Swin (SimVP 10x) |
| Video Prediction | Moving MNIST | MAE | 61.65 | ViT (SimVP 10x) |
| Video Prediction | Moving MNIST | MSE | 19.74 | ViT (SimVP 10x) |
| Video Prediction | Moving MNIST | SSIM | 0.9539 | ViT (SimVP 10x) |
| Video Prediction | Moving MNIST | MAE | 64.31 | Poolformer (SimVP 10x) |
| Video Prediction | Moving MNIST | MSE | 20.96 | Poolformer (SimVP 10x) |
| Video Prediction | Moving MNIST | MAE | 67.37 | ConvMixer (SimVP 10x) |
| Video Prediction | Moving MNIST | MSE | 22.3 | ConvMixer (SimVP 10x) |
| 3D | COCO 2017 val | AP | 56.2 | MogaNet-XL (Cascade Mask R-CNN) |
| 3D | COCO 2017 val | AP | 53.3 | MogaNet-L (Cascade Mask R-CNN) |
| 3D | COCO 2017 val | AP | 52.6 | MogaNet-B (Cascade Mask R-CNN) |
| 3D | COCO 2017 val | AP | 51.6 | MogaNet-S (Cascade Mask R-CNN) |
| 3D | COCO 2017 val | AP | 49.4 | MogaNet-L (Mask R-CNN 1x) |
| 3D | COCO 2017 val | AP | 48.7 | MogaNet-L (RetinaNet 1x) |
| 3D | COCO 2017 val | AP | 47.9 | MogaNet-B (Mask R-CNN 1x) |
| 3D | COCO 2017 val | AP | 47.7 | MogaNet-B (RetinaNet 1x) |
| 3D | COCO 2017 val | AP | 46.7 | MogaNet-S (Mask R-CNN 1x) |
| 3D | COCO 2017 val | AP | 45.8 | MogaNet-S (RetinaNet 1x) |
| 3D | COCO 2017 val | AP | 42.6 | MogaNet-T (Mask R-CNN 1x) |
| 3D | COCO 2017 val | AP | 41.4 | MogaNet-T (RetinaNet 1x) |
| 3D | COCO 2017 val | AP | 40.7 | MogaNet-XT (Mask R-CNN 1x) |
| 3D | COCO 2017 val | AP | 39.7 | MogaNet-XT (RetinaNet 1x) |
| 3D | COCO val2017 | AP | 77.3 | MogaNet-B (384x288) |
| 3D | COCO val2017 | AP50 | 91.4 | MogaNet-B (384x288) |
| 3D | COCO val2017 | AP75 | 84 | MogaNet-B (384x288) |
| 3D | COCO val2017 | AR | 82.2 | MogaNet-B (384x288) |
| 3D | COCO val2017 | AP | 76.4 | MogaNet-S (384x288) |
| 3D | COCO val2017 | AP50 | 91 | MogaNet-S (384x288) |
| 3D | COCO val2017 | AP75 | 83.3 | MogaNet-S (384x288) |
| 3D | COCO val2017 | AR | 81.4 | MogaNet-S (384x288) |
| 3D | COCO val2017 | AP | 74.9 | MogaNet-S (256x192) |
| 3D | COCO val2017 | AR | 80.1 | MogaNet-S (256x192) |
| 3D | COCO val2017 | AP | 73.2 | MogaNet-T (256x192) |
| 3D | COCO val2017 | AP50 | 90.1 | MogaNet-T (256x192) |
| 3D | COCO val2017 | AP75 | 81 | MogaNet-T (256x192) |
| 3D | COCO val2017 | AR | 78.8 | MogaNet-T (256x192) |
| Instance Segmentation | COCO val2017 | AP50 | 90.7 | MogaNet-S (256x192) |
| Instance Segmentation | COCO val2017 | AP75 | 82.8 | MogaNet-S (256x192) |
| Instance Segmentation | COCO test-dev | mask AP | 48.8 | MogaNet-XL (Cascade Mask R-CNN) |
| Instance Segmentation | COCO test-dev | mask AP | 46.1 | MogaNet-L (Cascade Mask R-CNN) |
| Instance Segmentation | COCO test-dev | mask AP | 46 | MogaNet-B (Cascade Mask R-CNN) |
| Instance Segmentation | COCO test-dev | mask AP | 45.1 | MogaNet-S (Cascade Mask R-CNN) |
| Instance Segmentation | COCO test-dev | mask AP | 44.1 | MogaNet-L (Mask R-CNN 1x) |
| Instance Segmentation | COCO test-dev | mask AP | 43.2 | MogaNet-B (Mask R-CNN 1x) |
| Instance Segmentation | COCO test-dev | mask AP | 42.2 | MogaNet-S (Mask R-CNN 1x) |
| Instance Segmentation | COCO test-dev | mask AP | 39.1 | MogaNet-T (Mask R-CNN 1x) |
| Instance Segmentation | COCO test-dev | mask AP | 37.6 | MogaNet-XT |
| Instance Segmentation | COCO test-dev | mask AP | 35.8 | MogaNet-T |
| 2D Classification | COCO 2017 val | AP | 56.2 | MogaNet-XL (Cascade Mask R-CNN) |
| 2D Classification | COCO 2017 val | AP | 53.3 | MogaNet-L (Cascade Mask R-CNN) |
| 2D Classification | COCO 2017 val | AP | 52.6 | MogaNet-B (Cascade Mask R-CNN) |
| 2D Classification | COCO 2017 val | AP | 51.6 | MogaNet-S (Cascade Mask R-CNN) |
| 2D Classification | COCO 2017 val | AP | 49.4 | MogaNet-L (Mask R-CNN 1x) |
| 2D Classification | COCO 2017 val | AP | 48.7 | MogaNet-L (RetinaNet 1x) |
| 2D Classification | COCO 2017 val | AP | 47.9 | MogaNet-B (Mask R-CNN 1x) |
| 2D Classification | COCO 2017 val | AP | 47.7 | MogaNet-B (RetinaNet 1x) |
| 2D Classification | COCO 2017 val | AP | 46.7 | MogaNet-S (Mask R-CNN 1x) |
| 2D Classification | COCO 2017 val | AP | 45.8 | MogaNet-S (RetinaNet 1x) |
| 2D Classification | COCO 2017 val | AP | 42.6 | MogaNet-T (Mask R-CNN 1x) |
| 2D Classification | COCO 2017 val | AP | 41.4 | MogaNet-T (RetinaNet 1x) |
| 2D Classification | COCO 2017 val | AP | 40.7 | MogaNet-XT (Mask R-CNN 1x) |
| 2D Classification | COCO 2017 val | AP | 39.7 | MogaNet-XT (RetinaNet 1x) |
| 2D Object Detection | COCO 2017 val | AP | 56.2 | MogaNet-XL (Cascade Mask R-CNN) |
| 2D Object Detection | COCO 2017 val | AP | 53.3 | MogaNet-L (Cascade Mask R-CNN) |
| 2D Object Detection | COCO 2017 val | AP | 52.6 | MogaNet-B (Cascade Mask R-CNN) |
| 2D Object Detection | COCO 2017 val | AP | 51.6 | MogaNet-S (Cascade Mask R-CNN) |
| 2D Object Detection | COCO 2017 val | AP | 49.4 | MogaNet-L (Mask R-CNN 1x) |
| 2D Object Detection | COCO 2017 val | AP | 48.7 | MogaNet-L (RetinaNet 1x) |
| 2D Object Detection | COCO 2017 val | AP | 47.9 | MogaNet-B (Mask R-CNN 1x) |
| 2D Object Detection | COCO 2017 val | AP | 47.7 | MogaNet-B (RetinaNet 1x) |
| 2D Object Detection | COCO 2017 val | AP | 46.7 | MogaNet-S (Mask R-CNN 1x) |
| 2D Object Detection | COCO 2017 val | AP | 45.8 | MogaNet-S (RetinaNet 1x) |
| 2D Object Detection | COCO 2017 val | AP | 42.6 | MogaNet-T (Mask R-CNN 1x) |
| 2D Object Detection | COCO 2017 val | AP | 41.4 | MogaNet-T (RetinaNet 1x) |
| 2D Object Detection | COCO 2017 val | AP | 40.7 | MogaNet-XT (Mask R-CNN 1x) |
| 2D Object Detection | COCO 2017 val | AP | 39.7 | MogaNet-XT (RetinaNet 1x) |
| 10-shot image generation | ADE20K | Validation mIoU | 54 | MogaNet-XL (UperNet) |
| 10-shot image generation | ADE20K | GFLOPs (512 x 512) | 1176 | MogaNet-L (UperNet) |
| 10-shot image generation | ADE20K | Validation mIoU | 50.9 | MogaNet-L (UperNet) |
| 10-shot image generation | ADE20K | GFLOPs (512 x 512) | 1050 | MogaNet-B (UperNet) |
| 10-shot image generation | ADE20K | Validation mIoU | 50.1 | MogaNet-B (UperNet) |
| 10-shot image generation | ADE20K | GFLOPs (512 x 512) | 946 | MogaNet-S (UperNet) |
| 10-shot image generation | ADE20K | Validation mIoU | 49.2 | MogaNet-S (UperNet) |
| 10-shot image generation | ADE20K | GFLOPs (512 x 512) | 189 | MogaNet-S (Semantic FPN) |
| 10-shot image generation | ADE20K | Validation mIoU | 47.7 | MogaNet-S (Semantic FPN) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AP | 77.3 | MogaNet-B (384x288) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AP50 | 91.4 | MogaNet-B (384x288) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AP75 | 84 | MogaNet-B (384x288) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AR | 82.2 | MogaNet-B (384x288) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AP | 76.4 | MogaNet-S (384x288) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AP50 | 91 | MogaNet-S (384x288) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AP75 | 83.3 | MogaNet-S (384x288) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AR | 81.4 | MogaNet-S (384x288) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AP | 74.9 | MogaNet-S (256x192) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AR | 80.1 | MogaNet-S (256x192) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AP | 73.2 | MogaNet-T (256x192) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AP50 | 90.1 | MogaNet-T (256x192) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AP75 | 81 | MogaNet-T (256x192) |
| 1 Image, 2*2 Stitchi | COCO val2017 | AR | 78.8 | MogaNet-T (256x192) |
| 16k | COCO 2017 val | AP | 56.2 | MogaNet-XL (Cascade Mask R-CNN) |
| 16k | COCO 2017 val | AP | 53.3 | MogaNet-L (Cascade Mask R-CNN) |
| 16k | COCO 2017 val | AP | 52.6 | MogaNet-B (Cascade Mask R-CNN) |
| 16k | COCO 2017 val | AP | 51.6 | MogaNet-S (Cascade Mask R-CNN) |
| 16k | COCO 2017 val | AP | 49.4 | MogaNet-L (Mask R-CNN 1x) |
| 16k | COCO 2017 val | AP | 48.7 | MogaNet-L (RetinaNet 1x) |
| 16k | COCO 2017 val | AP | 47.9 | MogaNet-B (Mask R-CNN 1x) |
| 16k | COCO 2017 val | AP | 47.7 | MogaNet-B (RetinaNet 1x) |
| 16k | COCO 2017 val | AP | 46.7 | MogaNet-S (Mask R-CNN 1x) |
| 16k | COCO 2017 val | AP | 45.8 | MogaNet-S (RetinaNet 1x) |
| 16k | COCO 2017 val | AP | 42.6 | MogaNet-T (Mask R-CNN 1x) |
| 16k | COCO 2017 val | AP | 41.4 | MogaNet-T (RetinaNet 1x) |
| 16k | COCO 2017 val | AP | 40.7 | MogaNet-XT (Mask R-CNN 1x) |
| 16k | COCO 2017 val | AP | 39.7 | MogaNet-XT (RetinaNet 1x) |