Xiuquan Hou, Meiqin Liu, Senlin Zhang, Ping Wei, Badong Chen
DETR-like methods have significantly increased detection performance in an end-to-end manner. The mainstream two-stage frameworks of them perform dense self-attention and select a fraction of queries for sparse cross-attention, which is proven effective for improving performance but also introduces a heavy computational burden and high dependence on stable query selection. This paper demonstrates that suboptimal two-stage selection strategies result in scale bias and redundancy due to the mismatch between selected queries and objects in two-stage initialization. To address these issues, we propose hierarchical salience filtering refinement, which performs transformer encoding only on filtered discriminative queries, for a better trade-off between computational efficiency and precision. The filtering process overcomes scale bias through a novel scale-independent salience supervision. To compensate for the semantic misalignment among queries, we introduce elaborate query refinement modules for stable two-stage initialization. Based on above improvements, the proposed Salience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% AP on three challenging task-specific detection datasets, as well as 49.2% AP on COCO 2017 with less FLOPs. The code is available at https://github.com/xiuqhou/Salience-DETR.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Object Detection | COCO 2017 val | AP | 57.3 | Salience-DETR (Focal-L 1x) |
| Object Detection | COCO 2017 val | AP50 | 75.5 | Salience-DETR (Focal-L 1x) |
| Object Detection | COCO 2017 val | AP75 | 62.3 | Salience-DETR (Focal-L 1x) |
| Object Detection | COCO 2017 val | APL | 74.5 | Salience-DETR (Focal-L 1x) |
| Object Detection | COCO 2017 val | APM | 61.8 | Salience-DETR (Focal-L 1x) |
| Object Detection | COCO 2017 val | APS | 40.9 | Salience-DETR (Focal-L 1x) |
| Object Detection | COCO 2017 val | AP | 56.5 | Salience-DETR (Swin-L 1x) |
| Object Detection | COCO 2017 val | AP50 | 75 | Salience-DETR (Swin-L 1x) |
| Object Detection | COCO 2017 val | AP75 | 61.5 | Salience-DETR (Swin-L 1x) |
| Object Detection | COCO 2017 val | APL | 72.8 | Salience-DETR (Swin-L 1x) |
| Object Detection | COCO 2017 val | APM | 61.2 | Salience-DETR (Swin-L 1x) |
| Object Detection | COCO 2017 val | APS | 40.2 | Salience-DETR (Swin-L 1x) |
| Object Detection | COCO 2017 val | AP | 50 | Salience-DETR (ResNet50 1x) |
| Object Detection | COCO 2017 val | AP50 | 67.7 | Salience-DETR (ResNet50 1x) |
| Object Detection | COCO 2017 val | AP75 | 54.2 | Salience-DETR (ResNet50 1x) |
| Object Detection | COCO 2017 val | APL | 64.4 | Salience-DETR (ResNet50 1x) |
| Object Detection | COCO 2017 val | APM | 54.4 | Salience-DETR (ResNet50 1x) |
| Object Detection | COCO 2017 val | APS | 33.3 | Salience-DETR (ResNet50 1x) |
| 3D | COCO 2017 val | AP | 57.3 | Salience-DETR (Focal-L 1x) |
| 3D | COCO 2017 val | AP50 | 75.5 | Salience-DETR (Focal-L 1x) |
| 3D | COCO 2017 val | AP75 | 62.3 | Salience-DETR (Focal-L 1x) |
| 3D | COCO 2017 val | APL | 74.5 | Salience-DETR (Focal-L 1x) |
| 3D | COCO 2017 val | APM | 61.8 | Salience-DETR (Focal-L 1x) |
| 3D | COCO 2017 val | APS | 40.9 | Salience-DETR (Focal-L 1x) |
| 3D | COCO 2017 val | AP | 56.5 | Salience-DETR (Swin-L 1x) |
| 3D | COCO 2017 val | AP50 | 75 | Salience-DETR (Swin-L 1x) |
| 3D | COCO 2017 val | AP75 | 61.5 | Salience-DETR (Swin-L 1x) |
| 3D | COCO 2017 val | APL | 72.8 | Salience-DETR (Swin-L 1x) |
| 3D | COCO 2017 val | APM | 61.2 | Salience-DETR (Swin-L 1x) |
| 3D | COCO 2017 val | APS | 40.2 | Salience-DETR (Swin-L 1x) |
| 3D | COCO 2017 val | AP | 50 | Salience-DETR (ResNet50 1x) |
| 3D | COCO 2017 val | AP50 | 67.7 | Salience-DETR (ResNet50 1x) |
| 3D | COCO 2017 val | AP75 | 54.2 | Salience-DETR (ResNet50 1x) |
| 3D | COCO 2017 val | APL | 64.4 | Salience-DETR (ResNet50 1x) |
| 3D | COCO 2017 val | APM | 54.4 | Salience-DETR (ResNet50 1x) |
| 3D | COCO 2017 val | APS | 33.3 | Salience-DETR (ResNet50 1x) |
| 2D Classification | COCO 2017 val | AP | 57.3 | Salience-DETR (Focal-L 1x) |
| 2D Classification | COCO 2017 val | AP50 | 75.5 | Salience-DETR (Focal-L 1x) |
| 2D Classification | COCO 2017 val | AP75 | 62.3 | Salience-DETR (Focal-L 1x) |
| 2D Classification | COCO 2017 val | APL | 74.5 | Salience-DETR (Focal-L 1x) |
| 2D Classification | COCO 2017 val | APM | 61.8 | Salience-DETR (Focal-L 1x) |
| 2D Classification | COCO 2017 val | APS | 40.9 | Salience-DETR (Focal-L 1x) |
| 2D Classification | COCO 2017 val | AP | 56.5 | Salience-DETR (Swin-L 1x) |
| 2D Classification | COCO 2017 val | AP50 | 75 | Salience-DETR (Swin-L 1x) |
| 2D Classification | COCO 2017 val | AP75 | 61.5 | Salience-DETR (Swin-L 1x) |
| 2D Classification | COCO 2017 val | APL | 72.8 | Salience-DETR (Swin-L 1x) |
| 2D Classification | COCO 2017 val | APM | 61.2 | Salience-DETR (Swin-L 1x) |
| 2D Classification | COCO 2017 val | APS | 40.2 | Salience-DETR (Swin-L 1x) |
| 2D Classification | COCO 2017 val | AP | 50 | Salience-DETR (ResNet50 1x) |
| 2D Classification | COCO 2017 val | AP50 | 67.7 | Salience-DETR (ResNet50 1x) |
| 2D Classification | COCO 2017 val | AP75 | 54.2 | Salience-DETR (ResNet50 1x) |
| 2D Classification | COCO 2017 val | APL | 64.4 | Salience-DETR (ResNet50 1x) |
| 2D Classification | COCO 2017 val | APM | 54.4 | Salience-DETR (ResNet50 1x) |
| 2D Classification | COCO 2017 val | APS | 33.3 | Salience-DETR (ResNet50 1x) |
| 2D Object Detection | COCO 2017 val | AP | 57.3 | Salience-DETR (Focal-L 1x) |
| 2D Object Detection | COCO 2017 val | AP50 | 75.5 | Salience-DETR (Focal-L 1x) |
| 2D Object Detection | COCO 2017 val | AP75 | 62.3 | Salience-DETR (Focal-L 1x) |
| 2D Object Detection | COCO 2017 val | APL | 74.5 | Salience-DETR (Focal-L 1x) |
| 2D Object Detection | COCO 2017 val | APM | 61.8 | Salience-DETR (Focal-L 1x) |
| 2D Object Detection | COCO 2017 val | APS | 40.9 | Salience-DETR (Focal-L 1x) |
| 2D Object Detection | COCO 2017 val | AP | 56.5 | Salience-DETR (Swin-L 1x) |
| 2D Object Detection | COCO 2017 val | AP50 | 75 | Salience-DETR (Swin-L 1x) |
| 2D Object Detection | COCO 2017 val | AP75 | 61.5 | Salience-DETR (Swin-L 1x) |
| 2D Object Detection | COCO 2017 val | APL | 72.8 | Salience-DETR (Swin-L 1x) |
| 2D Object Detection | COCO 2017 val | APM | 61.2 | Salience-DETR (Swin-L 1x) |
| 2D Object Detection | COCO 2017 val | APS | 40.2 | Salience-DETR (Swin-L 1x) |
| 2D Object Detection | COCO 2017 val | AP | 50 | Salience-DETR (ResNet50 1x) |
| 2D Object Detection | COCO 2017 val | AP50 | 67.7 | Salience-DETR (ResNet50 1x) |
| 2D Object Detection | COCO 2017 val | AP75 | 54.2 | Salience-DETR (ResNet50 1x) |
| 2D Object Detection | COCO 2017 val | APL | 64.4 | Salience-DETR (ResNet50 1x) |
| 2D Object Detection | COCO 2017 val | APM | 54.4 | Salience-DETR (ResNet50 1x) |
| 2D Object Detection | COCO 2017 val | APS | 33.3 | Salience-DETR (ResNet50 1x) |
| 16k | COCO 2017 val | AP | 57.3 | Salience-DETR (Focal-L 1x) |
| 16k | COCO 2017 val | AP50 | 75.5 | Salience-DETR (Focal-L 1x) |
| 16k | COCO 2017 val | AP75 | 62.3 | Salience-DETR (Focal-L 1x) |
| 16k | COCO 2017 val | APL | 74.5 | Salience-DETR (Focal-L 1x) |
| 16k | COCO 2017 val | APM | 61.8 | Salience-DETR (Focal-L 1x) |
| 16k | COCO 2017 val | APS | 40.9 | Salience-DETR (Focal-L 1x) |
| 16k | COCO 2017 val | AP | 56.5 | Salience-DETR (Swin-L 1x) |
| 16k | COCO 2017 val | AP50 | 75 | Salience-DETR (Swin-L 1x) |
| 16k | COCO 2017 val | AP75 | 61.5 | Salience-DETR (Swin-L 1x) |
| 16k | COCO 2017 val | APL | 72.8 | Salience-DETR (Swin-L 1x) |
| 16k | COCO 2017 val | APM | 61.2 | Salience-DETR (Swin-L 1x) |
| 16k | COCO 2017 val | APS | 40.2 | Salience-DETR (Swin-L 1x) |
| 16k | COCO 2017 val | AP | 50 | Salience-DETR (ResNet50 1x) |
| 16k | COCO 2017 val | AP50 | 67.7 | Salience-DETR (ResNet50 1x) |
| 16k | COCO 2017 val | AP75 | 54.2 | Salience-DETR (ResNet50 1x) |
| 16k | COCO 2017 val | APL | 64.4 | Salience-DETR (ResNet50 1x) |
| 16k | COCO 2017 val | APM | 54.4 | Salience-DETR (ResNet50 1x) |
| 16k | COCO 2017 val | APS | 33.3 | Salience-DETR (ResNet50 1x) |