16k on COCO minival

Metric: box AP (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide augmentations

Sort:

#	Model↕	box AP▼	Augmentations	Paper	Date↕	Code
1	PE_spatial (DETA)	66	Yes	Perception Encoder: The best visual embeddings a...	2025-04-17	Code
2	Co-DETR	65.9	Yes	DETRs with Collaborative Hybrid Assignments Trai...	2022-11-22	Code
3	M3I Pre-training (InternImage-H)	65	Yes	Towards All-in-one Pre-training via Maximizing M...	2022-11-17	Code
4	InternImage-H	65	Yes	InternImage: Exploring Large-Scale Vision Founda...	2022-11-10	Code
5	Co-DETR (Swin-L)	64.7	Yes	DETRs with Collaborative Hybrid Assignments Trai...	2022-11-22	Code
6	Focal-Stable-DINO (Focal-Huge, no TTA)	64.6	Yes	A Strong and Reproducible Object Detector with O...	2023-04-25	Code
7	EVA	64.5	Yes	EVA: Exploring the Limits of Masked Visual Repre...	2022-11-14	Code
8	ViT-CoMer	64.3	No	-	-	Code
9	FocalNet-H (DINO)	64.2	Yes	Focal Modulation Networks	2022-03-22	Code
10	InternImage-XL	64.2	Yes	InternImage: Exploring Large-Scale Vision Founda...	2022-11-10	Code
11	CP-DETR-L Swin-L(Fine tuning separately in COCO)	64.1	Yes	CP-DETR: Concept Prompt Guide DETR Toward Strong...	2024-12-13	-
12	RevCol-H(DINO)	63.8	Yes	Reversible Column Networks	2022-12-22	Code
13	DINO (Swin-L)	63.2	No	DINO: DETR with Improved DeNoising Anchor Boxes ...	2022-03-07	Code
14	Grounding DINO	63	Yes	Grounding DINO: Marrying DINO with Grounded Pre-...	2023-03-09	Code
15	SwinV2-G (HTC++)	62.5	Yes	Swin Transformer V2: Scaling Up Capacity and Res...	2021-11-18	Code
16	Florence-CoSwin-H	62	Yes	Florence: A New Foundation Model for Computer Vi...	2021-11-22	Code
17	GLEE-Pro	62	Yes	General Object Foundation Model for Images and V...	2023-12-14	Code
18	ViTDet, ViT-H Cascade (multiscale)	61.3	No	Exploring Plain Vision Transformer Backbones for...	2022-03-30	Code
19	GLIP (Swin-L, multi-scale)	60.8	Yes	Grounded Language-Image Pre-training	2021-12-07	Code
20	Soft Teacher + Swin-L (HTC++, multi-scale)	60.7	Yes	End-to-End Semi-Supervised Object Detection with...	2021-06-16	Code
21	UNINEXT-H	60.6	Yes	Universal Instance Perception as Object Discover...	2023-03-12	Code
22	ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)	60.5	No	Vision Transformer Adapter for Dense Predictions	2022-05-17	Code
23	ViTDet, ViT-H Cascade	60.4	No	Exploring Plain Vision Transformer Backbones for...	2022-03-30	Code
24	GLEE-Plus	60.4	Yes	General Object Foundation Model for Images and V...	2023-12-14	Code
25	DyHead (Swin-L, multi scale, self-training)	60.3	Yes	Dynamic Head: Unifying Object Detection Heads wi...	2021-06-15	Code
26	ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)	60.2	No	Vision Transformer Adapter for Dense Predictions	2022-05-17	Code
27	Soft Teacher+Swin-L(HTC++, single scale)	60.1	Yes	End-to-End Semi-Supervised Object Detection with...	2021-06-16	Code
28	CBNetV2 (Dual-Swin-L HTC, multi-scale)	59.6	No	CBNet: A Composite Backbone Network Architecture...	2021-07-01	Code
29	Frozen Backbone, SwinV2-G-ext22K (HTC)	59.3	No	Could Giant Pretrained Image Models Extract Univ...	2022-11-03	-
30	HorNet-L	59.2	No	HorNet: Efficient High-Order Spatial Interaction...	2022-07-28	Code
31	MOAT-3 (IN-22K pretraining, single-scale)	59.2	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
32	CBNetV2 (Dual-Swin-L HTC, multi-scale)	59.1	No	CBNet: A Composite Backbone Network Architecture...	2021-07-01	Code
33	Focal-L (DyHead, multi-scale)	58.7	No	Focal Self-attention for Local-Global Interactio...	2021-07-01	Code
34	MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)	58.7	No	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
35	MOAT-2 (IN-22K pretraining, single-scale)	58.5	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
36	DyHead (Swin-L, multi scale)	58.4	No	Dynamic Head: Unifying Object Detection Heads wi...	2021-06-15	Code
37	Swin-L (HTC++, multi scale)	58	No	Swin Transformer: Hierarchical Vision Transforme...	2021-03-25	Code
38	MOAT-1 (IN-1K pretraining, single-scale)	57.7	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
39	UM-MAE(HTC++, Swin-L, IN1K)	57.4	No	Uniform Masking: Enabling MAE Pre-training for P...	2022-05-20	Code
40	YOLOv6-L6(46 fps, 1280, V100)	57.2	No	YOLOv6 v3.0: A Full-Scale Reloading	2023-01-13	Code
41	Swin-L (HTC++, single scale)	57.1	No	Swin Transformer: Hierarchical Vision Transforme...	2021-03-25	Code
42	TransNeXt-Base (IN-1K pretrain, DINO 1x)	57.1	No	TransNeXt: Robust Foveal Visual Perception for V...	2023-11-28	Code
43	Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale)	57	Yes	Simple Copy-Paste is a Strong Data Augmentation ...	2020-12-13	Code
44	TransNeXt-Small (IN-1K pretrain, DINO 1x)	56.6	No	TransNeXt: Robust Foveal Visual Perception for V...	2023-11-28	Code
45	QueryInst (single scale)	56.1	No	Instances as Queries	2021-05-05	Code
46	MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)	56.1	No	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
47	MOAT-0 (IN-1K pretraining, single-scale)	55.9	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
48	TransNeXt-Tiny (IN-1K pretrain, DINO 1x)	55.7	No	TransNeXt: Robust Foveal Visual Perception for V...	2023-11-28	Code
49	YOLOv4-P7 CSP-P7 (single-scale, 16 fps)	55.4	No	Scaled-YOLOv4: Scaling Cross Stage Partial Network	2020-11-16	Code
50	tiny-MOAT-3 (IN-1K pretraining, single-scale)	55.2	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
51	FAN-L-Hybrid	55.1	No	Understanding The Robustness in Vision Transform...	2022-04-26	Code
52	Hiera-L	55	No	Hiera: A Hierarchical Vision Transformer without...	2023-06-01	Code
53	GLEE-Lite	55	Yes	General Object Foundation Model for Images and V...	2023-12-14	Code
54	TEC(VIT-B, Mask-RCNN)	54.6	No	Towards Sustainable Self-supervised Learning	2022-10-20	Code
55	Cascade Eff-B7 NAS-FPN (1280)	54.5	No	Simple Copy-Paste is a Strong Data Augmentation ...	2020-12-13	Code
56	CAE (ViT-L, Mask R-CNN, 1x schedule)	54.5	No	Context Autoencoder for Self-Supervised Represen...	2022-02-07	Code
57	MViTv2-L (Cascade Mask R-CNN, single-scale)	54.3	No	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
58	SpineNet-190 (1280, with Self-training on OpenImages, single-scale)	54.2	Yes	Rethinking Pre-training and Self-training	2020-06-11	Code
59	Cascade RCNN-RS (SpineNet-143L, single scale)	53.6	No	Simple Training Strategies and Model Scaling for...	2021-06-30	Code
60	UniverseNet-20.08d (Res2Net-101, DCN, multi-scale)	53.5	No	USB: Universal-Scale Object Detection Benchmark	2021-03-25	Code
61	MAE (ViT-L, Mask R-CNN)	53.3	No	Masked Autoencoders Are Scalable Vision Learners	2021-11-11	Code
62	Cascade RCNN-RS (ResNet-200, single scale)	53.1	No	Simple Training Strategies and Model Scaling for...	2021-06-30	Code
63	tiny-MOAT-2 (IN-1K pretraining, single-scale)	53	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
64	MViT-L (Mask R-CNN, single-scale, IN21k pre-train)	52.7	No	MViTv2: Improved Multiscale Vision Transformers ...	2021-12-02	Code
65	ResNeSt-200 (multi-scale)	52.47	No	ResNeSt: Split-Attention Networks	2020-04-19	Code
66	ActiveMLP-B (Cascade Mask R-CNN)	52.3	No	Active Token Mixer	2022-03-11	Code
67	RetinaNet (SpineNet-190, 1536x1536)	52.2	No	SpineNet: Learning Scale-Permuted Backbone for R...	2019-12-10	Code
68	EfficientDet-D7 (1536)	52.1	No	EfficientDet: Scalable and Efficient Object Dete...	2019-11-20	Code
69	tiny-MOAT-1 (IN-1K pretraining, single-scale)	51.9	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
70	GCNet (ResNeXt-101 + DCN + cascade + GC r4)	51.8	No	Global Context Networks	2020-12-24	Code
71	ELSA-S (Cascade Mask RCNN)	51.6	No	ELSA: Enhanced Local Self-Attention for Vision T...	2021-12-23	Code
72	FocalNet-T (LRF, Cascade Mask R-CNN)	51.5	No	Focal Modulation Networks	2022-03-22	Code
73	DINO-5scale (24 epoch)	51.3	No	DINO: DETR with Improved DeNoising Anchor Boxes ...	2022-03-07	Code
74	DINO-5scale (36 epoch)	51.2	No	DINO: DETR with Improved DeNoising Anchor Boxes ...	2022-03-07	Code
75	ResNeSt-200-DCN (single-scale)	50.91	No	ResNeSt: Split-Attention Networks	2020-04-19	Code
76	UniverseNet-20.08d (Res2Net-101, DCN, single-scale)	50.9	No	USB: Universal-Scale Object Detection Benchmark	2021-03-25	Code
77	ResNeSt-200 (single-scale)	50.54	No	ResNeSt: Split-Attention Networks	2020-04-19	Code
78	tiny-MOAT-0 (IN-1K pretraining, single-scale)	50.5	No	MOAT: Alternating Mobile Convolution and Attenti...	2022-10-04	Code
79	MAE (ViT-B, Mask R-CNN)	50.3	No	Masked Autoencoders Are Scalable Vision Learners	2021-11-11	Code
80	Sparse R-CNN (PVTv2-B2)	50.1	No	PVT v2: Improved Baselines with Pyramid Vision T...	2021-06-25	Code
81	Pix2seq (ViT-L)	50	Yes	Pix2seq: A Language Modeling Framework for Objec...	2021-09-22	Code
82	DaViT-T (Mask R-CNN, 36 epochs)	49.9	No	DaViT: Dual Attention Vision Transformers	2022-04-07	Code
83	BoTNet 200 (Mask R-CNN, single scale, 72 epochs)	49.7	No	Bottleneck Transformers for Visual Recognition	2021-01-27	Code
84	BoTNet 152 (Mask R-CNN, single scale, 72 epochs)	49.5	No	Bottleneck Transformers for Visual Recognition	2021-01-27	Code
85	DN-Deformable-DETR-R50++	49.5	No	DN-DETR: Accelerate DETR Training by Introducing...	2022-03-02	Code
86	REGO-Deformable DETR-X101	49.1	No	Recurrent Glimpse-based Decoder for Detection wi...	2021-12-09	Code
87	CenterMask+VoVNet99 (multi-scale)	48.6	No	CenterMask : Real-Time Anchor-Free Instance Segm...	2019-11-15	Code
88	Mask R-CNN (ResNeXt-152-FPN, cascade)	48.6	No	Rethinking ImageNet Pre-training	2018-11-21	Code
89	UniverseNet-20.08 (Res2Net-50, DCN, single-scale)	48.5	No	USB: Universal-Scale Object Detection Benchmark	2021-03-25	Code
90	XCiT-M24/8	48.5	No	XCiT: Cross-Covariance Image Transformers	2021-06-17	Code
91	ELSA-S (Mask RCNN)	48.3	No	ELSA: Enhanced Local Self-Attention for Vision T...	2021-12-23	Code
92	XCiT-S24/8	48.1	No	XCiT: Cross-Covariance Image Transformers	2021-06-17	Code
93	GCNet (ResNeXt-101 + DCN + cascade + GC r16)	47.9	No	GCNet: Non-local Networks Meet Squeeze-Excitatio...	2019-04-25	Code
94	MAE-Det(MAE-Det-L+GFLV2)	47.8	No	MAE-DET: Revisiting Maximum Entropy Principle in...	2021-11-26	Code
95	Res2Net101+HTC	47.5	No	Res2Net: A New Multi-scale Backbone Architecture	2019-04-02	Code
96	Mask R-CNN (ResNet-101-FPN, GN, Cascade)	47.4	No	Rethinking ImageNet Pre-training	2018-11-21	Code
97	Pix2seq (R50-C4)	47.3	No	Pix2seq: A Language Modeling Framework for Objec...	2021-09-22	Code
98	Pix2seq (ViT-B)	47.1	No	Pix2seq: A Language Modeling Framework for Objec...	2021-09-22	Code
99	HTC (HRNetV2p-W48)	47	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
100	PatchConvNet-S120 (Mask R-CNN)	47	No	Augmenting Convolutional networks with attention...	2021-12-27	Code
101	RPDet (ResNeXt-101-DCN, multi-scale)	46.8	No	RepPoints: Point Set Representation for Object D...	2019-04-25	Code
102	DAB-DETR-DC5-R101	46.6	No	DAB-DETR: Dynamic Anchor Boxes are Better Querie...	2022-01-28	Code
103	DyHead (ResNet-101)	46.5	No	Dynamic Head: Unifying Object Detection Heads wi...	2021-06-15	Code
104	Mask R-CNN (ResNeXt-152-FPN)	46.4	No	Rethinking ImageNet Pre-training	2018-11-21	Code
105	RPDet (ResNet-101-DCN, multi-scale)	46.4	No	RepPoints: Point Set Representation for Object D...	2019-04-25	Code
106	PatchConvNet-S60 (Mask R-CNN)	46.4	No	Augmenting Convolutional networks with attention...	2021-12-27	Code
107	Cascade Mask R-CNN (ResNet-50)	46.3	No	Deep Residual Learning for Image Recognition	2015-12-10	Code
108	HoughNet (HG-104, MS)	46.1	No	HoughNet: Integrating near and long-range eviden...	2020-07-05	Code
109	Mask R-CNN (HRNetV2p-W48, cascade)	46	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
110	Conditional DETR-DC5-R101	45.9	No	Conditional DETR for Fast Training Convergence	2021-08-13	Code
111	BoTNet 50 (72 epochs)	45.9	No	Bottleneck Transformers for Visual Recognition	2021-01-27	Code
112	Sparse R-CNN (ResNet-101, learnable proposals, random crop aug, FPN)	45.6	No	Sparse R-CNN: End-to-End Object Detection with L...	2020-11-25	Code
113	CenterMask+VoVNetV2-99 (single-scale)	45.6	No	CenterMask : Real-Time Anchor-Free Instance Segm...	2019-11-15	Code
114	HTC (HRNetV2p-W32)	45.3	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
115	Anchor DETR-DC5-R101	45.1	No	Anchor DETR: Query Design for Transformer-Based ...	2021-09-15	Code
116	Conditional DETR-DC5-R50	45.1	No	Conditional DETR for Fast Training Convergence	2021-08-13	Code
117	Mask R-CNN (ResNeXt-152 + 1 NL)	45	No	Non-local Neural Networks	2017-11-21	Code
118	Pix2seq (R101-DC5)	45	No	Pix2seq: A Language Modeling Framework for Objec...	2021-09-22	Code
119	Mask R-CNN-FPN (AOGNet-40M)	44.9	No	Attentive Normalization	2019-08-04	Code
120	DETR-DC5 (ResNet-101)	44.9	No	End-to-End Object Detection with Transformers	2020-05-26	Code
121	Mask R-CNN (VoVNetV2-99, single-scale)	44.9	No	CenterMask : Real-Time Anchor-Free Instance Segm...	2019-11-15	Code
122	R3-CNN (ResNet-50-FPN, DCN)	44.8	No	Recursively Refined R-CNN: Instance Segmentation...	2021-04-03	Code
123	RPDet (ResNet-101-DCN, multi-scale train)	44.8	No	RepPoints: Point Set Representation for Object D...	2019-04-25	Code
124	RetinaNet (ViL-Base, multi-scale, 3x)	44.7	No	Multi-Scale Vision Longformer: A New Vision Tran...	2021-03-29	Code
125	Cascade R-CNN (HRNetV2p-W48)	44.6	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
126	CenterMask+VoVNetV2-57 (single-scale)	44.6	No	CenterMask : Real-Time Anchor-Free Instance Segm...	2019-11-15	Code
127	Conditional DETR-R101	44.5	No	Conditional DETR for Fast Training Convergence	2021-08-13	Code
128	Sparse R-CNN (ResNet-50, learnable proposals, random crop aug, FPN)	44.5	No	Sparse R-CNN: End-to-End Object Detection with L...	2020-11-25	Code
129	GFL (ResNet-50)	44.5	No	Deep Residual Learning for Image Recognition	2015-12-10	Code
130	RPDet (ResNeXt-101-DCN)	44.5	No	RepPoints: Point Set Representation for Object D...	2019-04-25	Code
131	CenterMask+X101-32x8d (single-scale)	44.4	No	CenterMask : Real-Time Anchor-Free Instance Segm...	2019-11-15	Code
132	RetinaNet (ViL-Base)	44.3	No	Multi-Scale Vision Longformer: A New Vision Tran...	2021-03-29	Code
133	R3-CNN (ResNet-50-FPN, GC-Net)	44.3	No	Recursively Refined R-CNN: Instance Segmentation...	2021-04-03	Code
134	Anchor DETR-DC5-R50	44.2	No	Anchor DETR: Query Design for Transformer-Based ...	2021-09-15	Code
135	DAB-DETR-R101	44.1	No	DAB-DETR: Dynamic Anchor Boxes are Better Querie...	2022-01-28	Code
136	Faster RCNN-R101-FPN+	44	No	End-to-End Object Detection with Transformers	2020-05-26	Code
137	Cascade R-CNN (HRNetV2p-W32)	43.7	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
138	Sparse R-CNN (ResNet-101, FPN)	43.5	No	Sparse R-CNN: End-to-End Object Detection with L...	2020-11-25	Code
139	ATSS (ResNet-50)	43.5	No	Deep Residual Learning for Image Recognition	2015-12-10	Code
140	PVT-Large (RetinaNet 3x,MS)	43.4	No	Pyramid Vision Transformer: A Versatile Backbone...	2021-02-24	Code
141	ExtremeNet (Hourglass-104, multi-scale)	43.3	No	Bottom-up Object Detection by Grouping Extreme a...	2019-01-23	Code
142	Pix2seq (R50-DC5 )	43.2	No	Pix2seq: A Language Modeling Framework for Objec...	2021-09-22	Code
143	HTC (cascade)	43.2	No	Hybrid Task Cascade for Instance Segmentation	2019-01-22	Code
144	Mask R-CNN-FPN (ResNeXt-101, GN+WS)	43.12	No	Micro-Batch Training with Batch-Channel Normaliz...	2019-03-25	Code
145	HTC (HRNetV2p-W18)	43.1	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
146	Mask R-CNN (ResNet-101, DCNv2)	43.1	No	Deformable ConvNets v2: More Deformable, Better ...	2018-11-27	Code
147	Conditional DETR-R50	43	No	Conditional DETR for Fast Training Convergence	2021-08-13	Code
148	HoughNet (HG-104)	43	No	HoughNet: Integrating near and long-range eviden...	2020-07-05	Code
149	Faster R-CNN (FPN, X-volution)	42.8	No	X-volution: On the unification of convolution an...	2021-06-04	-
150	Cascade R-CNN (ResNet-101-FPN+, cascade)	42.7	No	Cascade R-CNN: Delving into High Quality Object ...	2017-12-03	Code
151	PVT-Large (RetinaNet 1x)	42.6	No	Pyramid Vision Transformer: A Versatile Backbone...	2021-02-24	Code
152	CornerNet-Saccade (Hourglass-54)	42.6	No	CornerNet-Lite: Efficient Keypoint Based Object ...	2019-04-18	Code
153	Pix2seq (R50)	42.6	No	Pix2seq: A Language Modeling Framework for Objec...	2021-09-22	Code
154	Mask R-CNN (ResNet-101-FPN, GroupNorm, long)	42.3	No	Group Normalization	2018-03-22	Code
155	Sparse R-CNN (ResNet-50, FPN)	42.3	No	Sparse R-CNN: End-to-End Object Detection with L...	2020-11-25	Code
156	Mask R-CNN (HRNetV2p-W32)	42.3	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
157	DETR-ResNet50 with iRPE-K (300 epochs)	42.3	No	Rethinking and Improving Relative Position Encod...	2021-07-29	Code
158	TridentNet (ResNet-101)	42	No	Scale-Aware Trident Networks for Object Detection	2019-01-07	Code
159	R3-CNN (ResNet-50-FPN)	42	No	Recursively Refined R-CNN: Instance Segmentation...	2021-04-03	Code
160	Faster R-CNN (HRNetV2p-W48)	41.8	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
161	Faster R-CNN (LIP-ResNet-101)	41.7	No	LIP: Local Importance-based Pooling	2019-08-12	Code
162	Faster R-CNN (ResNet-101, DCNv2)	41.7	No	Deformable ConvNets v2: More Deformable, Better ...	2018-11-27	Code
163	FSAF (ResNeXt-101, anchor-based branches)	41.6	No	Feature Selective Anchor-Free Module for Single-...	2019-03-02	Code
164	CornerNet-Saccade (Hourglass-104)	41.4	No	CornerNet-Lite: Efficient Keypoint Based Object ...	2019-04-18	Code
165	Grid R-CNN (ResNet-101-FPN)	41.3	No	Grid R-CNN	2018-11-29	Code
166	Cascade R-CNN (HRNetV2p-W18)	41.3	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
167	CenterNet511 (Hourglass-52)	41.3	No	CenterNet: Keypoint Triplets for Object Detection	2019-04-17	Code
168	RetinaMask (ResNet-101-FPN)	41.1	No	RetinaMask: Learning to predict masks improves s...	2019-01-10	Code
169	PoolFormer-S36 (Mask R-CNN)	41	No	MetaFormer Is Actually What You Need for Vision	2021-11-22	Code
170	Faster R-CNN (HRNetV2p-W32)	40.9	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
171	VirTex Mask R-CNN (ResNet-50-FPN)	40.9	No	VirTex: Learning Visual Representations from Tex...	2020-06-11	Code
172	Mask R-CNN (ResNet-101 + 1 NL)	40.8	No	Non-local Neural Networks	2017-11-21	Code
173	Mask R-CNN (ResNet-50-FPN, GroupNorm, long)	40.8	No	Group Normalization	2018-03-22	Code
174	RPDet (ResNet-50, multi-scale train)	40.8	No	RepPoints: Point Set Representation for Object D...	2019-04-25	Code
175	DETR-ResNet50 with iRPE-K (150 epochs)	40.8	No	Rethinking and Improving Relative Position Encod...	2021-07-29	Code
176	Faster R-CNN+aLRP Loss (ResNet-50, 500 scale)	40.7	No	A Ranking-based, Balanced Loss Function Unifying...	2020-09-28	Code
177	PPDet (ResNet-101-FPN)	40.5	No	Reducing Label Noise in Anchor-Free Object Detec...	2020-08-03	Code
178	GCnet (ResNet-50-FPN, GRoIE)	40.3	No	GCNet: Non-local Networks Meet Squeeze-Excitatio...	2019-04-25	Code
179	Mask R-CNN (ResNet-50-FPN, GroupNorm)	40.3	No	Group Normalization	2018-03-22	Code
180	Cascade R-CNN (ResNet-50-FPN+)	40.3	No	Cascade R-CNN: Delving into High Quality Object ...	2017-12-03	Code
181	ExtremeNet (Hourglass-104, single-scale)	40.3	No	Bottom-up Object Detection by Grouping Extreme a...	2019-01-23	Code
182	RPDet (ResNet-101)	40.3	No	RepPoints: Point Set Representation for Object D...	2019-04-25	Code
183	RetinaNet+aLRP Loss (ResNet-50, 500 scale)	40.2	No	A Ranking-based, Balanced Loss Function Unifying...	2020-09-28	Code
184	Mask R-CNN (ResNet-101-FPN)	40	No	Mask R-CNN	2017-03-20	Code
185	FPN+	39.8	No	Feature Pyramid Networks for Object Detection	2016-12-09	Code
186	FoveaBox+aLRP Loss (ResNet-50, 500 scale)	39.7	No	A Ranking-based, Balanced Loss Function Unifying...	2020-09-28	Code
187	Grid R-CNN (ResNet-50-FPN)	39.6	No	Grid R-CNN	2018-11-29	Code
188	Mask R-CNN (ResNet-50, ACNet)	39.5	No	Adaptively Connected Neural Networks	2019-04-07	Code
189	FSAF (ResNet-101, anchor-based branches)	39.3	No	Feature Selective Anchor-Free Module for Single-...	2019-03-02	Code
190	Mask R-CNN (HRNetV2p-W18)	39.2	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
191	Mask R-CNN (ResNet-50 + 1 NL)	39	No	Non-local Neural Networks	2017-11-21	Code
192	FoveaBox (ResNet-101-FPN, 800x800)	38.9	No	FoveaBox: Beyond Anchor-based Object Detector	2019-04-08	Code
193	FCOS (ResNet-50-FPN + improvements)	38.6	No	FCOS: Fully Convolutional One-Stage Object Detec...	2019-04-02	Code
194	RPDet (ResNet-50)	38.6	No	RepPoints: Point Set Representation for Object D...	2019-04-25	Code
195	Libra R-CNN (ResNet-50 FPN)	38.5	No	Libra R-CNN: Towards Balanced Learning for Objec...	2019-04-04	Code
196	Mask R-CNN (ResNet-50-FPN, GRoIE)	38.4	No	A novel Region of Interest Extraction Layer for ...	2020-04-28	Code
197	CornerNet511 (Hourglass-104)	38.4	No	CornerNet: Detecting Objects as Paired Keypoints	2018-08-03	Code
198	FoveaBox+Retina (ResNet-50)	38.1	No	FoveaBox: Beyond Anchor-based Object Detector	2019-04-08	Code
199	Faster R-CNN (HRNetV2p-W18)	38	No	Deep High-Resolution Representation Learning for...	2019-08-20	Code
200	FoveaBox (ResNet-101-FPN, 600x600)	38	No	FoveaBox: Beyond Anchor-based Object Detector	2019-04-08	Code
201	FSAF (ResNet-101)	37.9	No	Feature Selective Anchor-Free Module for Single-...	2019-03-02	Code
202	Mask R-CNN (ResNet-50-FPN)	37.7	No	Mask R-CNN	2017-03-20	Code
203	Faster R-CNN (ResNet-50-FPN, GRoIE)	37.5	No	A novel Region of Interest Extraction Layer for ...	2020-04-28	Code
204	Mask R-CNN (ResNeXt-101-FPN)	36.7	No	Mask R-CNN	2017-03-20	Code
205	FoveaBox (ResNet-50-FPN, 600x600)	36	No	FoveaBox: Beyond Anchor-based Object Detector	2019-04-08	Code
206	FSAF (ResNet-50)	35.9	No	Feature Selective Anchor-Free Module for Single-...	2019-03-02	Code
207	GHM-C + GHM-R (RetinaNet-FPN-ResNet-50, M=30)	35.8	No	Gradient Harmonized Single-stage Detector	2018-11-13	Code
208	Online Fg Bal. Sampling+Hard Negative Mining (ResNet-50)	35.6	No	Generating Positive Bounding Boxes for Balanced ...	2019-09-21	Code
209	M2Det (ResNet-1o1, 320x320)	34.1	No	M2Det: A Single-Shot Object Detector based on Mu...	2018-11-12	Code
210	Faster R-CNN (Res2Net-50)	33.7	No	Res2Net: A New Multi-scale Backbone Architecture	2019-04-02	Code
211	M2Det (VGG-16, 320x320)	33.2	No	M2Det: A Single-Shot Object Detector based on Mu...	2018-11-12	Code

#1PE_spatial (DETA)SOTA
66
box AP· Augmentations· 2025-04-17
Perception Encoder: The best visual embeddings are not at the output of the network Code
#2Co-DETRSOTA
65.9
box AP· Augmentations· 2022-11-22
DETRs with Collaborative Hybrid Assignments Training Code
#3M3I Pre-training (InternImage-H)
65
box AP· Augmentations· 2022-11-17
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information Code
#4InternImage-HSOTA
65
box AP· Augmentations· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Code
#5Co-DETR (Swin-L)
64.7
box AP· Augmentations· 2022-11-22
DETRs with Collaborative Hybrid Assignments Training Code
#6Focal-Stable-DINO (Focal-Huge, no TTA)
64.6
box AP· Augmentations· 2023-04-25
A Strong and Reproducible Object Detector with Only Public Datasets Code
#7EVA
64.5
box AP· Augmentations· 2022-11-14
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale Code
#8ViT-CoMer
64.3
box AP
No paperCode
#9FocalNet-H (DINO)SOTA
64.2
box AP· Augmentations· 2022-03-22
Focal Modulation Networks Code
#10InternImage-XL
64.2
box AP· Augmentations· 2022-11-10
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Code
#11CP-DETR-L Swin-L(Fine tuning separately in COCO)
64.1
box AP· Augmentations· 2024-12-13
CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection
#12RevCol-H(DINO)
63.8
box AP· Augmentations· 2022-12-22
Reversible Column Networks Code
#13DINO (Swin-L)SOTA
63.2
box AP· 2022-03-07
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection Code
#14Grounding DINO
63
box AP· Augmentations· 2023-03-09
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection Code
#15SwinV2-G (HTC++)SOTA
62.5
box AP· Augmentations· 2021-11-18
Swin Transformer V2: Scaling Up Capacity and Resolution Code
#16Florence-CoSwin-H
62
box AP· Augmentations· 2021-11-22
Florence: A New Foundation Model for Computer Vision Code
#17GLEE-Pro
62
box AP· Augmentations· 2023-12-14
General Object Foundation Model for Images and Videos at Scale Code
#18ViTDet, ViT-H Cascade (multiscale)
61.3
box AP· 2022-03-30
Exploring Plain Vision Transformer Backbones for Object Detection Code
#19GLIP (Swin-L, multi-scale)
60.8
box AP· Augmentations· 2021-12-07
Grounded Language-Image Pre-training Code
#20Soft Teacher + Swin-L (HTC++, multi-scale)SOTA
60.7
box AP· Augmentations· 2021-06-16
End-to-End Semi-Supervised Object Detection with Soft Teacher Code
#21UNINEXT-H
60.6
box AP· Augmentations· 2023-03-12
Universal Instance Perception as Object Discovery and Retrieval Code
#22ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale)
60.5
box AP· 2022-05-17
Vision Transformer Adapter for Dense Predictions Code
#23ViTDet, ViT-H Cascade
60.4
box AP· 2022-03-30
Exploring Plain Vision Transformer Backbones for Object Detection Code
#24GLEE-Plus
60.4
box AP· Augmentations· 2023-12-14
General Object Foundation Model for Images and Videos at Scale Code
#25DyHead (Swin-L, multi scale, self-training)SOTA
60.3
box AP· Augmentations· 2021-06-15
Dynamic Head: Unifying Object Detection Heads with Attentions Code
#26ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale)
60.2
box AP· 2022-05-17
Vision Transformer Adapter for Dense Predictions Code
#27Soft Teacher+Swin-L(HTC++, single scale)
60.1
box AP· Augmentations· 2021-06-16
End-to-End Semi-Supervised Object Detection with Soft Teacher Code
#28CBNetV2 (Dual-Swin-L HTC, multi-scale)
59.6
box AP· 2021-07-01
CBNet: A Composite Backbone Network Architecture for Object Detection Code
#29Frozen Backbone, SwinV2-G-ext22K (HTC)
59.3
box AP· 2022-11-03
Could Giant Pretrained Image Models Extract Universal Representations?
#30HorNet-L
59.2
box AP· 2022-07-28
HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions Code
#31MOAT-3 (IN-22K pretraining, single-scale)
59.2
box AP· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#32CBNetV2 (Dual-Swin-L HTC, multi-scale)
59.1
box AP· 2021-07-01
CBNet: A Composite Backbone Network Architecture for Object Detection Code
#33Focal-L (DyHead, multi-scale)
58.7
box AP· 2021-07-01
Focal Self-attention for Local-Global Interactions in Vision Transformers Code
#34MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
58.7
box AP· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#35MOAT-2 (IN-22K pretraining, single-scale)
58.5
box AP· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#36DyHead (Swin-L, multi scale)
58.4
box AP· 2021-06-15
Dynamic Head: Unifying Object Detection Heads with Attentions Code
#37Swin-L (HTC++, multi scale)SOTA
58
box AP· 2021-03-25
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Code
#38MOAT-1 (IN-1K pretraining, single-scale)
57.7
box AP· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#39UM-MAE(HTC++, Swin-L, IN1K)
57.4
box AP· 2022-05-20
Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality Code
#40YOLOv6-L6(46 fps, 1280, V100)
57.2
box AP· 2023-01-13
YOLOv6 v3.0: A Full-Scale Reloading Code
#41Swin-L (HTC++, single scale)
57.1
box AP· 2021-03-25
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Code
#42TransNeXt-Base (IN-1K pretrain, DINO 1x)
57.1
box AP· 2023-11-28
TransNeXt: Robust Foveal Visual Perception for Vision Transformers Code
#43Cascade Eff-B7 NAS-FPN (1280, self-training Copy Paste, single-scale)SOTA
57
box AP· Augmentations· 2020-12-13
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation Code
#44TransNeXt-Small (IN-1K pretrain, DINO 1x)
56.6
box AP· 2023-11-28
TransNeXt: Robust Foveal Visual Perception for Vision Transformers Code
#45QueryInst (single scale)
56.1
box AP· 2021-05-05
Instances as Queries Code
#46MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
56.1
box AP· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#47MOAT-0 (IN-1K pretraining, single-scale)
55.9
box AP· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#48TransNeXt-Tiny (IN-1K pretrain, DINO 1x)
55.7
box AP· 2023-11-28
TransNeXt: Robust Foveal Visual Perception for Vision Transformers Code
#49YOLOv4-P7 CSP-P7 (single-scale, 16 fps)SOTA
55.4
box AP· 2020-11-16
Scaled-YOLOv4: Scaling Cross Stage Partial Network Code
#50tiny-MOAT-3 (IN-1K pretraining, single-scale)
55.2
box AP· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#51FAN-L-Hybrid
55.1
box AP· 2022-04-26
Understanding The Robustness in Vision Transformers Code
#52Hiera-L
55
box AP· 2023-06-01
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles Code
#53GLEE-Lite
55
box AP· Augmentations· 2023-12-14
General Object Foundation Model for Images and Videos at Scale Code
#54TEC(VIT-B, Mask-RCNN)
54.6
box AP· 2022-10-20
Towards Sustainable Self-supervised Learning Code
#55Cascade Eff-B7 NAS-FPN (1280)
54.5
box AP· 2020-12-13
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation Code
#56CAE (ViT-L, Mask R-CNN, 1x schedule)
54.5
box AP· 2022-02-07
Context Autoencoder for Self-Supervised Representation Learning Code
#57MViTv2-L (Cascade Mask R-CNN, single-scale)
54.3
box AP· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#58SpineNet-190 (1280, with Self-training on OpenImages, single-scale)SOTA
54.2
box AP· Augmentations· 2020-06-11
Rethinking Pre-training and Self-training Code
#59Cascade RCNN-RS (SpineNet-143L, single scale)
53.6
box AP· 2021-06-30
Simple Training Strategies and Model Scaling for Object Detection Code
#60UniverseNet-20.08d (Res2Net-101, DCN, multi-scale)
53.5
box AP· 2021-03-25
USB: Universal-Scale Object Detection Benchmark Code
#61MAE (ViT-L, Mask R-CNN)
53.3
box AP· 2021-11-11
Masked Autoencoders Are Scalable Vision Learners Code
#62Cascade RCNN-RS (ResNet-200, single scale)
53.1
box AP· 2021-06-30
Simple Training Strategies and Model Scaling for Object Detection Code
#63tiny-MOAT-2 (IN-1K pretraining, single-scale)
53
box AP· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#64MViT-L (Mask R-CNN, single-scale, IN21k pre-train)
52.7
box AP· 2021-12-02
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Code
#65ResNeSt-200 (multi-scale)SOTA
52.47
box AP· 2020-04-19
ResNeSt: Split-Attention Networks Code
#66ActiveMLP-B (Cascade Mask R-CNN)
52.3
box AP· 2022-03-11
Active Token Mixer Code
#67RetinaNet (SpineNet-190, 1536x1536)SOTA
52.2
box AP· 2019-12-10
SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization Code
#68EfficientDet-D7 (1536)SOTA
52.1
box AP· 2019-11-20
EfficientDet: Scalable and Efficient Object Detection Code
#69tiny-MOAT-1 (IN-1K pretraining, single-scale)
51.9
box AP· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#70GCNet (ResNeXt-101 + DCN + cascade + GC r4)
51.8
box AP· 2020-12-24
Global Context Networks Code
#71ELSA-S (Cascade Mask RCNN)
51.6
box AP· 2021-12-23
ELSA: Enhanced Local Self-Attention for Vision Transformer Code
#72FocalNet-T (LRF, Cascade Mask R-CNN)
51.5
box AP· 2022-03-22
Focal Modulation Networks Code
#73DINO-5scale (24 epoch)
51.3
box AP· 2022-03-07
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection Code
#74DINO-5scale (36 epoch)
51.2
box AP· 2022-03-07
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection Code
#75ResNeSt-200-DCN (single-scale)
50.91
box AP· 2020-04-19
ResNeSt: Split-Attention Networks Code
#76UniverseNet-20.08d (Res2Net-101, DCN, single-scale)
50.9
box AP· 2021-03-25
USB: Universal-Scale Object Detection Benchmark Code
#77ResNeSt-200 (single-scale)
50.54
box AP· 2020-04-19
ResNeSt: Split-Attention Networks Code
#78tiny-MOAT-0 (IN-1K pretraining, single-scale)
50.5
box AP· 2022-10-04
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models Code
#79MAE (ViT-B, Mask R-CNN)
50.3
box AP· 2021-11-11
Masked Autoencoders Are Scalable Vision Learners Code
#80Sparse R-CNN (PVTv2-B2)
50.1
box AP· 2021-06-25
PVT v2: Improved Baselines with Pyramid Vision Transformer Code
#81Pix2seq (ViT-L)
50
box AP· Augmentations· 2021-09-22
Pix2seq: A Language Modeling Framework for Object Detection Code
#82DaViT-T (Mask R-CNN, 36 epochs)
49.9
box AP· 2022-04-07
DaViT: Dual Attention Vision Transformers Code
#83BoTNet 200 (Mask R-CNN, single scale, 72 epochs)
49.7
box AP· 2021-01-27
Bottleneck Transformers for Visual Recognition Code
#84BoTNet 152 (Mask R-CNN, single scale, 72 epochs)
49.5
box AP· 2021-01-27
Bottleneck Transformers for Visual Recognition Code
#85DN-Deformable-DETR-R50++
49.5
box AP· 2022-03-02
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising Code
#86REGO-Deformable DETR-X101
49.1
box AP· 2021-12-09
Recurrent Glimpse-based Decoder for Detection with Transformer Code
#87CenterMask+VoVNet99 (multi-scale)
48.6
box AP· 2019-11-15
CenterMask : Real-Time Anchor-Free Instance Segmentation Code
#88Mask R-CNN (ResNeXt-152-FPN, cascade)SOTA
48.6
box AP· 2018-11-21
Rethinking ImageNet Pre-training Code
#89UniverseNet-20.08 (Res2Net-50, DCN, single-scale)
48.5
box AP· 2021-03-25
USB: Universal-Scale Object Detection Benchmark Code
#90XCiT-M24/8
48.5
box AP· 2021-06-17
XCiT: Cross-Covariance Image Transformers Code
#91ELSA-S (Mask RCNN)
48.3
box AP· 2021-12-23
ELSA: Enhanced Local Self-Attention for Vision Transformer Code
#92XCiT-S24/8
48.1
box AP· 2021-06-17
XCiT: Cross-Covariance Image Transformers Code
#93GCNet (ResNeXt-101 + DCN + cascade + GC r16)
47.9
box AP· 2019-04-25
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond Code
#94MAE-Det(MAE-Det-L+GFLV2)
47.8
box AP· 2021-11-26
MAE-DET: Revisiting Maximum Entropy Principle in Zero-Shot NAS for Efficient Object Detection Code
#95Res2Net101+HTC
47.5
box AP· 2019-04-02
Res2Net: A New Multi-scale Backbone Architecture Code
#96Mask R-CNN (ResNet-101-FPN, GN, Cascade)
47.4
box AP· 2018-11-21
Rethinking ImageNet Pre-training Code
#97Pix2seq (R50-C4)
47.3
box AP· 2021-09-22
Pix2seq: A Language Modeling Framework for Object Detection Code
#98Pix2seq (ViT-B)
47.1
box AP· 2021-09-22
Pix2seq: A Language Modeling Framework for Object Detection Code
#99HTC (HRNetV2p-W48)
47
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#100PatchConvNet-S120 (Mask R-CNN)
47
box AP· 2021-12-27
Augmenting Convolutional networks with attention-based aggregation Code
#101RPDet (ResNeXt-101-DCN, multi-scale)
46.8
box AP· 2019-04-25
RepPoints: Point Set Representation for Object Detection Code
#102DAB-DETR-DC5-R101
46.6
box AP· 2022-01-28
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR Code
#103DyHead (ResNet-101)
46.5
box AP· 2021-06-15
Dynamic Head: Unifying Object Detection Heads with Attentions Code
#104Mask R-CNN (ResNeXt-152-FPN)
46.4
box AP· 2018-11-21
Rethinking ImageNet Pre-training Code
#105RPDet (ResNet-101-DCN, multi-scale)
46.4
box AP· 2019-04-25
RepPoints: Point Set Representation for Object Detection Code
#106PatchConvNet-S60 (Mask R-CNN)
46.4
box AP· 2021-12-27
Augmenting Convolutional networks with attention-based aggregation Code
#107Cascade Mask R-CNN (ResNet-50)SOTA
46.3
box AP· 2015-12-10
Deep Residual Learning for Image Recognition Code
#108HoughNet (HG-104, MS)
46.1
box AP· 2020-07-05
HoughNet: Integrating near and long-range evidence for bottom-up object detection Code
#109Mask R-CNN (HRNetV2p-W48, cascade)
46
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#110Conditional DETR-DC5-R101
45.9
box AP· 2021-08-13
Conditional DETR for Fast Training Convergence Code
#111BoTNet 50 (72 epochs)
45.9
box AP· 2021-01-27
Bottleneck Transformers for Visual Recognition Code
#112Sparse R-CNN (ResNet-101, learnable proposals, random crop aug, FPN)
45.6
box AP· 2020-11-25
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals Code
#113CenterMask+VoVNetV2-99 (single-scale)
45.6
box AP· 2019-11-15
CenterMask : Real-Time Anchor-Free Instance Segmentation Code
#114HTC (HRNetV2p-W32)
45.3
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#115Anchor DETR-DC5-R101
45.1
box AP· 2021-09-15
Anchor DETR: Query Design for Transformer-Based Object Detection Code
#116Conditional DETR-DC5-R50
45.1
box AP· 2021-08-13
Conditional DETR for Fast Training Convergence Code
#117Mask R-CNN (ResNeXt-152 + 1 NL)
45
box AP· 2017-11-21
Non-local Neural Networks Code
#118Pix2seq (R101-DC5)
45
box AP· 2021-09-22
Pix2seq: A Language Modeling Framework for Object Detection Code
#119Mask R-CNN-FPN (AOGNet-40M)
44.9
box AP· 2019-08-04
Attentive Normalization Code
#120DETR-DC5 (ResNet-101)
44.9
box AP· 2020-05-26
End-to-End Object Detection with Transformers Code
#121Mask R-CNN (VoVNetV2-99, single-scale)
44.9
box AP· 2019-11-15
CenterMask : Real-Time Anchor-Free Instance Segmentation Code
#122R3-CNN (ResNet-50-FPN, DCN)
44.8
box AP· 2021-04-03
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing Code
#123RPDet (ResNet-101-DCN, multi-scale train)
44.8
box AP· 2019-04-25
RepPoints: Point Set Representation for Object Detection Code
#124RetinaNet (ViL-Base, multi-scale, 3x)
44.7
box AP· 2021-03-29
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding Code
#125Cascade R-CNN (HRNetV2p-W48)
44.6
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#126CenterMask+VoVNetV2-57 (single-scale)
44.6
box AP· 2019-11-15
CenterMask : Real-Time Anchor-Free Instance Segmentation Code
#127Conditional DETR-R101
44.5
box AP· 2021-08-13
Conditional DETR for Fast Training Convergence Code
#128Sparse R-CNN (ResNet-50, learnable proposals, random crop aug, FPN)
44.5
box AP· 2020-11-25
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals Code
#129GFL (ResNet-50)
44.5
box AP· 2015-12-10
Deep Residual Learning for Image Recognition Code
#130RPDet (ResNeXt-101-DCN)
44.5
box AP· 2019-04-25
RepPoints: Point Set Representation for Object Detection Code
#131CenterMask+X101-32x8d (single-scale)
44.4
box AP· 2019-11-15
CenterMask : Real-Time Anchor-Free Instance Segmentation Code
#132RetinaNet (ViL-Base)
44.3
box AP· 2021-03-29
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding Code
#133R3-CNN (ResNet-50-FPN, GC-Net)
44.3
box AP· 2021-04-03
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing Code
#134Anchor DETR-DC5-R50
44.2
box AP· 2021-09-15
Anchor DETR: Query Design for Transformer-Based Object Detection Code
#135DAB-DETR-R101
44.1
box AP· 2022-01-28
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR Code
#136Faster RCNN-R101-FPN+
44
box AP· 2020-05-26
End-to-End Object Detection with Transformers Code
#137Cascade R-CNN (HRNetV2p-W32)
43.7
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#138Sparse R-CNN (ResNet-101, FPN)
43.5
box AP· 2020-11-25
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals Code
#139ATSS (ResNet-50)
43.5
box AP· 2015-12-10
Deep Residual Learning for Image Recognition Code
#140PVT-Large (RetinaNet 3x,MS)
43.4
box AP· 2021-02-24
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions Code
#141ExtremeNet (Hourglass-104, multi-scale)
43.3
box AP· 2019-01-23
Bottom-up Object Detection by Grouping Extreme and Center Points Code
#142Pix2seq (R50-DC5 )
43.2
box AP· 2021-09-22
Pix2seq: A Language Modeling Framework for Object Detection Code
#143HTC (cascade)
43.2
box AP· 2019-01-22
Hybrid Task Cascade for Instance Segmentation Code
#144Mask R-CNN-FPN (ResNeXt-101, GN+WS)
43.12
box AP· 2019-03-25
Micro-Batch Training with Batch-Channel Normalization and Weight Standardization Code
#145HTC (HRNetV2p-W18)
43.1
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#146Mask R-CNN (ResNet-101, DCNv2)
43.1
box AP· 2018-11-27
Deformable ConvNets v2: More Deformable, Better Results Code
#147Conditional DETR-R50
43
box AP· 2021-08-13
Conditional DETR for Fast Training Convergence Code
#148HoughNet (HG-104)
43
box AP· 2020-07-05
HoughNet: Integrating near and long-range evidence for bottom-up object detection Code
#149Faster R-CNN (FPN, X-volution)
42.8
box AP· 2021-06-04
X-volution: On the unification of convolution and self-attention
#150Cascade R-CNN (ResNet-101-FPN+, cascade)
42.7
box AP· 2017-12-03
Cascade R-CNN: Delving into High Quality Object Detection Code
#151PVT-Large (RetinaNet 1x)
42.6
box AP· 2021-02-24
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions Code
#152CornerNet-Saccade (Hourglass-54)
42.6
box AP· 2019-04-18
CornerNet-Lite: Efficient Keypoint Based Object Detection Code
#153Pix2seq (R50)
42.6
box AP· 2021-09-22
Pix2seq: A Language Modeling Framework for Object Detection Code
#154Mask R-CNN (ResNet-101-FPN, GroupNorm, long)
42.3
box AP· 2018-03-22
Group Normalization Code
#155Sparse R-CNN (ResNet-50, FPN)
42.3
box AP· 2020-11-25
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals Code
#156Mask R-CNN (HRNetV2p-W32)
42.3
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#157DETR-ResNet50 with iRPE-K (300 epochs)
42.3
box AP· 2021-07-29
Rethinking and Improving Relative Position Encoding for Vision Transformer Code
#158TridentNet (ResNet-101)
42
box AP· 2019-01-07
Scale-Aware Trident Networks for Object Detection Code
#159R3-CNN (ResNet-50-FPN)
42
box AP· 2021-04-03
Recursively Refined R-CNN: Instance Segmentation with Self-RoI Rebalancing Code
#160Faster R-CNN (HRNetV2p-W48)
41.8
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#161Faster R-CNN (LIP-ResNet-101)
41.7
box AP· 2019-08-12
LIP: Local Importance-based Pooling Code
#162Faster R-CNN (ResNet-101, DCNv2)
41.7
box AP· 2018-11-27
Deformable ConvNets v2: More Deformable, Better Results Code
#163FSAF (ResNeXt-101, anchor-based branches)
41.6
box AP· 2019-03-02
Feature Selective Anchor-Free Module for Single-Shot Object Detection Code
#164CornerNet-Saccade (Hourglass-104)
41.4
box AP· 2019-04-18
CornerNet-Lite: Efficient Keypoint Based Object Detection Code
#165Grid R-CNN (ResNet-101-FPN)
41.3
box AP· 2018-11-29
Grid R-CNN Code
#166Cascade R-CNN (HRNetV2p-W18)
41.3
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#167CenterNet511 (Hourglass-52)
41.3
box AP· 2019-04-17
CenterNet: Keypoint Triplets for Object Detection Code
#168RetinaMask (ResNet-101-FPN)
41.1
box AP· 2019-01-10
RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free Code
#169PoolFormer-S36 (Mask R-CNN)
41
box AP· 2021-11-22
MetaFormer Is Actually What You Need for Vision Code
#170Faster R-CNN (HRNetV2p-W32)
40.9
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#171VirTex Mask R-CNN (ResNet-50-FPN)
40.9
box AP· 2020-06-11
VirTex: Learning Visual Representations from Textual Annotations Code
#172Mask R-CNN (ResNet-101 + 1 NL)
40.8
box AP· 2017-11-21
Non-local Neural Networks Code
#173Mask R-CNN (ResNet-50-FPN, GroupNorm, long)
40.8
box AP· 2018-03-22
Group Normalization Code
#174RPDet (ResNet-50, multi-scale train)
40.8
box AP· 2019-04-25
RepPoints: Point Set Representation for Object Detection Code
#175DETR-ResNet50 with iRPE-K (150 epochs)
40.8
box AP· 2021-07-29
Rethinking and Improving Relative Position Encoding for Vision Transformer Code
#176Faster R-CNN+aLRP Loss (ResNet-50, 500 scale)
40.7
box AP· 2020-09-28
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection Code
#177PPDet (ResNet-101-FPN)
40.5
box AP· 2020-08-03
Reducing Label Noise in Anchor-Free Object Detection Code
#178GCnet (ResNet-50-FPN, GRoIE)
40.3
box AP· 2019-04-25
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond Code
#179Mask R-CNN (ResNet-50-FPN, GroupNorm)
40.3
box AP· 2018-03-22
Group Normalization Code
#180Cascade R-CNN (ResNet-50-FPN+)
40.3
box AP· 2017-12-03
Cascade R-CNN: Delving into High Quality Object Detection Code
#181ExtremeNet (Hourglass-104, single-scale)
40.3
box AP· 2019-01-23
Bottom-up Object Detection by Grouping Extreme and Center Points Code
#182RPDet (ResNet-101)
40.3
box AP· 2019-04-25
RepPoints: Point Set Representation for Object Detection Code
#183RetinaNet+aLRP Loss (ResNet-50, 500 scale)
40.2
box AP· 2020-09-28
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection Code
#184Mask R-CNN (ResNet-101-FPN)
40
box AP· 2017-03-20
Mask R-CNN Code
#185FPN+
39.8
box AP· 2016-12-09
Feature Pyramid Networks for Object Detection Code
#186FoveaBox+aLRP Loss (ResNet-50, 500 scale)
39.7
box AP· 2020-09-28
A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection Code
#187Grid R-CNN (ResNet-50-FPN)
39.6
box AP· 2018-11-29
Grid R-CNN Code
#188Mask R-CNN (ResNet-50, ACNet)
39.5
box AP· 2019-04-07
Adaptively Connected Neural Networks Code
#189FSAF (ResNet-101, anchor-based branches)
39.3
box AP· 2019-03-02
Feature Selective Anchor-Free Module for Single-Shot Object Detection Code
#190Mask R-CNN (HRNetV2p-W18)
39.2
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#191Mask R-CNN (ResNet-50 + 1 NL)
39
box AP· 2017-11-21
Non-local Neural Networks Code
#192FoveaBox (ResNet-101-FPN, 800x800)
38.9
box AP· 2019-04-08
FoveaBox: Beyond Anchor-based Object Detector Code
#193FCOS (ResNet-50-FPN + improvements)
38.6
box AP· 2019-04-02
FCOS: Fully Convolutional One-Stage Object Detection Code
#194RPDet (ResNet-50)
38.6
box AP· 2019-04-25
RepPoints: Point Set Representation for Object Detection Code
#195Libra R-CNN (ResNet-50 FPN)
38.5
box AP· 2019-04-04
Libra R-CNN: Towards Balanced Learning for Object Detection Code
#196Mask R-CNN (ResNet-50-FPN, GRoIE)
38.4
box AP· 2020-04-28
A novel Region of Interest Extraction Layer for Instance Segmentation Code
#197CornerNet511 (Hourglass-104)
38.4
box AP· 2018-08-03
CornerNet: Detecting Objects as Paired Keypoints Code
#198FoveaBox+Retina (ResNet-50)
38.1
box AP· 2019-04-08
FoveaBox: Beyond Anchor-based Object Detector Code
#199Faster R-CNN (HRNetV2p-W18)
38
box AP· 2019-08-20
Deep High-Resolution Representation Learning for Visual Recognition Code
#200FoveaBox (ResNet-101-FPN, 600x600)
38
box AP· 2019-04-08
FoveaBox: Beyond Anchor-based Object Detector Code
#201FSAF (ResNet-101)
37.9
box AP· 2019-03-02
Feature Selective Anchor-Free Module for Single-Shot Object Detection Code
#202Mask R-CNN (ResNet-50-FPN)
37.7
box AP· 2017-03-20
Mask R-CNN Code
#203Faster R-CNN (ResNet-50-FPN, GRoIE)
37.5
box AP· 2020-04-28
A novel Region of Interest Extraction Layer for Instance Segmentation Code
#204Mask R-CNN (ResNeXt-101-FPN)
36.7
box AP· 2017-03-20
Mask R-CNN Code
#205FoveaBox (ResNet-50-FPN, 600x600)
36
box AP· 2019-04-08
FoveaBox: Beyond Anchor-based Object Detector Code
#206FSAF (ResNet-50)
35.9
box AP· 2019-03-02
Feature Selective Anchor-Free Module for Single-Shot Object Detection Code
#207GHM-C + GHM-R (RetinaNet-FPN-ResNet-50, M=30)
35.8
box AP· 2018-11-13
Gradient Harmonized Single-stage Detector Code
#208Online Fg Bal. Sampling+Hard Negative Mining (ResNet-50)
35.6
box AP· 2019-09-21
Generating Positive Bounding Boxes for Balanced Training of Object Detectors Code
#209M2Det (ResNet-1o1, 320x320)
34.1
box AP· 2018-11-12
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network Code
#210Faster R-CNN (Res2Net-50)
33.7
box AP· 2019-04-02
Res2Net: A New Multi-scale Backbone Architecture Code
#211M2Det (VGG-16, 320x320)
33.2
box AP· 2018-11-12
M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network Code