| 1 | Co-DETR | 56.6 | Yes | DETRs with Collaborative Hybrid Assignments Trai... | 2022-11-22 | Code |
| 2 | ViT-CoMer-L (Mask RCNN, DINOv2) | 55.9 | No | - | - | Code |
| 3 | InternImage-H | 55.4 | Yes | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 4 | EVA | 55 | Yes | EVA: Exploring the Limits of Masked Visual Repre... | 2022-11-14 | Code |
| 5 | Mask Frozen-DETR | 54.9 | Yes | Mask Frozen-DETR: High Quality Instance Segmenta... | 2023-08-07 | - |
| 6 | MasK DINO (SwinL, multi-scale) | 54.5 | Yes | Mask DINO: Towards A Unified Transformer-based F... | 2022-06-06 | Code |
| 7 | ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) | 54.2 | Yes | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 8 | GLEE-Pro | 54.2 | Yes | General Object Foundation Model for Images and V... | 2023-12-14 | Code |
| 9 | SwinV2-G (HTC++) | 53.7 | Yes | Swin Transformer V2: Scaling Up Capacity and Res... | 2021-11-18 | Code |
| 10 | ViTDet, ViT-H Cascade (multiscale) | 53.1 | No | Exploring Plain Vision Transformer Backbones for... | 2022-03-30 | Code |
| 11 | GLEE-Plus | 53 | Yes | General Object Foundation Model for Images and V... | 2023-12-14 | Code |
| 12 | Mask DINO (SwinL) | 52.6 | No | Mask DINO: Towards A Unified Transformer-based F... | 2022-06-06 | Code |
| 13 | Soft Teacher + Swin-L(HTC++, multi-scale) | 52.5 | Yes | End-to-End Semi-Supervised Object Detection with... | 2021-06-16 | Code |
| 14 | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | 52.5 | No | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 15 | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | 52.2 | No | Vision Transformer Adapter for Dense Predictions | 2022-05-17 | Code |
| 16 | ViTDet, ViT-H Cascade | 52 | No | Exploring Plain Vision Transformer Backbones for... | 2022-03-30 | Code |
| 17 | Soft Teacher + Swin-L(HTC++, single-scale) | 51.9 | Yes | End-to-End Semi-Supervised Object Detection with... | 2021-06-16 | Code |
| 18 | CBNetV2 (Dual-Swin-L HTC, multi-scale) | 51.8 | No | CBNet: A Composite Backbone Network Architecture... | 2021-07-01 | Code |
| 19 | Frozen Backbone, SwinV2-G-ext22K (HTC) | 51.6 | No | Could Giant Pretrained Image Models Extract Univ... | 2022-11-03 | - |
| 20 | CBNetV2 (Dual-Swin-L HTC, multi-scale) | 51 | No | CBNet: A Composite Backbone Network Architecture... | 2021-07-01 | Code |
| 21 | Focal-L (HTC++, multi-scale) | 50.9 | No | Focal Self-attention for Local-Global Interactio... | 2021-07-01 | Code |
| 22 | DiNAT-L (single-scale, Mask2Former) | 50.8 | No | Dilated Neighborhood Attention Transformer | 2022-09-29 | Code |
| 23 | MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) | 50.5 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 24 | Swin-L (HTC++, multi scale) | 50.4 | No | Swin Transformer: Hierarchical Vision Transforme... | 2021-03-25 | Code |
| 25 | MOAT-3 (IN-22K pretraining, single-scale) | 50.3 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 26 | Mask2Former (Swin-L) | 50.1 | No | Masked-attention Mask Transformer for Universal ... | 2021-12-02 | Code |
| 27 | Swin-L (HTC++, single scale) | 49.5 | No | Swin Transformer: Hierarchical Vision Transforme... | 2021-03-25 | Code |
| 28 | MOAT-2 (IN-22K pretraining, single-scale) | 49.3 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 29 | MOAT-1 (IN-1K pretraining, single-scale) | 49 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 30 | QueryInst (single scale) | 48.9 | No | Instances as Queries | 2021-05-05 | Code |
| 31 | Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale) | 48.9 | Yes | Simple Copy-Paste is a Strong Data Augmentation ... | 2020-12-13 | Code |
| 32 | InternImage-XL | 48.8 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 33 | CenterNet2 (Swin-L w/ X-Paste + Copy-Paste) | 48.8 | No | X-Paste: Revisiting Scalable Copy-Paste for Inst... | 2022-12-07 | Code |
| 34 | Heira-L | 48.6 | No | Hiera: A Hierarchical Vision Transformer without... | 2023-06-01 | Code |
| 35 | InternImage-L | 48.5 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 36 | MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) | 48.5 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 37 | GLEE-Lite | 48.4 | Yes | General Object Foundation Model for Images and V... | 2023-12-14 | Code |
| 38 | MOAT-0 (IN-1K pretraining, single-scale) | 47.4 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 39 | MViTv2-L (Cascade Mask R-CNN, single-scale) | 47.1 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 40 | MPViT-B (Cascade Mask R-CNN, multi-scale, IN1k pre-train) | 47 | No | MPViT: Multi-Path Vision Transformer for Dense P... | 2021-12-21 | Code |
| 41 | tiny-MOAT-3 (IN-1K pretraining, single-scale) | 47 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 42 | Cascade Eff-B7 NAS-FPN (1280) | 46.8 | No | Simple Copy-Paste is a Strong Data Augmentation ... | 2020-12-13 | Code |
| 43 | ResNeSt-200 (multi-scale) | 46.25 | No | ResNeSt: Split-Attention Networks | 2020-04-19 | Code |
| 44 | MViT-L (Mask R-CNN, single-scale) | 46.2 | No | MViTv2: Improved Multiscale Vision Transformers ... | 2021-12-02 | Code |
| 45 | RetinaNet (SpineNet-190, 1536x1536) | 46.1 | No | SpineNet: Learning Scale-Permuted Backbone for R... | 2019-12-10 | Code |
| 46 | MPViT-B (Cascade R-CNN, sinlge-scale, IN-1K pre-train) | 45.8 | No | MPViT: Multi-Path Vision Transformer for Dense P... | 2021-12-21 | Code |
| 47 | Mask R-CNN (ViL Base, multi-scale, 3x lr) | 45.7 | No | Multi-Scale Vision Longformer: A New Vision Tran... | 2021-03-29 | Code |
| 48 | Mask R-CNN (ViL Base, 1x lr) | 45.1 | No | Multi-Scale Vision Longformer: A New Vision Tran... | 2021-03-29 | Code |
| 49 | tiny-MOAT-2 (IN-1K pretraining, single-scale) | 45 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 50 | GCNet (ResNeXt-101 + DCN + cascade + GC r4) | 44.7 | No | Global Context Networks | 2020-12-24 | Code |
| 51 | tiny-MOAT-1 (IN-1K pretraining, single-scale) | 44.6 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 52 | InternImage-S | 44.5 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 53 | ResNeSt-200-DCN (single-scale) | 44.5 | No | ResNeSt: Split-Attention Networks | 2020-04-19 | Code |
| 54 | ELSA-S (Cascade Mask RCNN) | 44.4 | No | ELSA: Enhanced Local Self-Attention for Vision T... | 2021-12-23 | Code |
| 55 | BoTNet 200 (Mask R-CNN, single scale, 72 epochs) | 44.4 | No | Bottleneck Transformers for Visual Recognition | 2021-01-27 | Code |
| 56 | DaViT-T (Mask R-CNN, 36 epochs) | 44.3 | No | DaViT: Dual Attention Vision Transformers | 2022-04-07 | Code |
| 57 | ResNeSt-200 (single-scale) | 44.21 | No | ResNeSt: Split-Attention Networks | 2020-04-19 | Code |
| 58 | InternImage-T | 43.7 | No | InternImage: Exploring Large-Scale Vision Founda... | 2022-11-10 | Code |
| 59 | BoTNet 152 (Mask R-CNN, single scale, 72 epochs) | 43.7 | No | Bottleneck Transformers for Visual Recognition | 2021-01-27 | Code |
| 60 | XCiT-M24/8 | 43.7 | No | XCiT: Cross-Covariance Image Transformers | 2021-06-17 | Code |
| 61 | tiny-MOAT-0 (IN-1K pretraining, single-scale) | 43.3 | No | MOAT: Alternating Mobile Convolution and Attenti... | 2022-10-04 | Code |
| 62 | ELSA-S (Mask RCNN) | 43 | No | ELSA: Enhanced Local Self-Attention for Vision T... | 2021-12-23 | Code |
| 63 | XCiT-S24/8 | 43 | No | XCiT: Cross-Covariance Image Transformers | 2021-06-17 | Code |
| 64 | CenterMask-VoVNetV2-99 (multi-scale) | 42.5 | No | CenterMask : Real-Time Anchor-Free Instance Segm... | 2019-11-15 | Code |
| 65 | ResNeSt-101 (single-scale) | 41.56 | No | ResNeSt: Split-Attention Networks | 2020-04-19 | Code |
| 66 | SIW | 41.4 | No | Scaling up Multi-domain Semantic Segmentation wi... | 2022-02-04 | - |
| 67 | Res2Net-101+HTC | 41.3 | No | Res2Net: A New Multi-scale Backbone Architecture | 2019-04-02 | Code |
| 68 | HTC (HRNetV2p-W48) | 41 | No | Deep High-Resolution Representation Learning for... | 2019-08-20 | Code |
| 69 | HTC (HRNetV2p-W48) | 41 | No | Deep High-Resolution Representation Learning for... | 2019-08-20 | Code |
| 70 | GCNet (ResNeXt-101 + DCN + cascade + GC r16) | 40.9 | No | GCNet: Non-local Networks Meet Squeeze-Excitatio... | 2019-04-25 | Code |
| 71 | BoTNet 50 (72 epochs) | 40.7 | No | Bottleneck Transformers for Visual Recognition | 2021-01-27 | Code |
| 72 | R3-CNN (ResNet-50-FPN, DCN) | 40.4 | No | Recursively Refined R-CNN: Instance Segmentation... | 2021-04-03 | Code |
| 73 | Mask R-CNN (ResNext-152, +1 NL) | 40.3 | No | Non-local Neural Networks | 2017-11-21 | Code |
| 74 | Mask R-CNN-FPN (AOGNet-40M) | 40.2 | No | Attentive Normalization | 2019-08-04 | Code |
| 75 | R3-CNN (ResNet-50-FPN, GC-Net) | 40.2 | No | Recursively Refined R-CNN: Instance Segmentation... | 2021-04-03 | Code |
| 76 | CenterMask-VoVNetV2-99-3x | 40.2 | No | CenterMask : Real-Time Anchor-Free Instance Segm... | 2019-11-15 | Code |
| 77 | R3-CNN (ResNet-50-FPN, GRoIE) | 39.1 | No | Recursively Refined R-CNN: Instance Segmentation... | 2021-04-03 | Code |
| 78 | Mask Scoring R-CNN (ResNet-101-FPN-DCN) | 39.1 | No | Mask Scoring R-CNN | 2019-03-01 | Code |
| 79 | Mask R-CNN-FPN (ResNeXt-101, GN+WS) | 38.34 | No | Micro-Batch Training with Batch-Channel Normaliz... | 2019-03-25 | Code |
| 80 | R3-CNN (ResNet-50-FPN) | 38.2 | No | Recursively Refined R-CNN: Instance Segmentation... | 2021-04-03 | Code |
| 81 | HTC (ResNet-50) | 38.2 | No | Hybrid Task Cascade for Instance Segmentation | 2019-01-22 | Code |
| 82 | Mask Scoring R-CNN (ResNet-101 FPN) | 38.2 | No | Mask Scoring R-CNN | 2019-03-01 | Code |
| 83 | PANet (ResNet-50) | 37.8 | No | Path Aggregation Network for Instance Segmentation | 2018-03-05 | Code |
| 84 | GCnet (ResNet-50-FPN, GRoIE) | 37.2 | No | A novel Region of Interest Extraction Layer for ... | 2020-04-28 | Code |
| 85 | Mask R-CNN (FPN, X-volution, SA) | 37.2 | No | X-volution: On the unification of convolution an... | 2021-06-04 | - |
| 86 | Mask R-CNN (ResNet-101, +1 NL) | 37.1 | No | Non-local Neural Networks | 2017-11-21 | Code |
| 87 | Mask Scoring R-CNN (ResNet-50 FPN) | 36 | No | Mask Scoring R-CNN | 2019-03-01 | Code |
| 88 | Mask R-CNN (ResNet-50-FPN, GRoIE) | 35.8 | No | A novel Region of Interest Extraction Layer for ... | 2020-04-28 | Code |
| 89 | Faster R-CNN (Res2Net-50) | 35.6 | No | Res2Net: A New Multi-scale Backbone Architecture | 2019-04-02 | Code |
| 90 | Mask R-CNN (ResNet-50, +1 NL) | 35.5 | No | Non-local Neural Networks | 2017-11-21 | Code |
| 91 | Mask R-CNN (ResNet-50, ACNet) | 35.2 | No | Adaptively Connected Neural Networks | 2019-04-07 | Code |
| 92 | YOLACT-550 (ResNet-50) | 29.9 | No | YOLACT: Real-time Instance Segmentation | 2019-04-04 | Code |