TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Focal Modulation Networks

Focal Modulation Networks

Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao

2022-03-22Panoptic SegmentationImage ClassificationSegmentationSemantic SegmentationObject Detection
PaperPDFCodeCodeCodeCode(official)CodeCodeCodeCodeCode

Abstract

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code and checkpoints are available at https://github.com/microsoft/FocalNet.

Results

TaskDatasetMetricValueModel
Semantic SegmentationADE20KValidation mIoU58.5FocalNet-L (Mask2Former)
Semantic SegmentationCOCO minivalAP48.4FocalNet-L (Mask2Former (200 queries))
Semantic SegmentationCOCO minivalPQ57.9FocalNet-L (Mask2Former (200 queries))
Object DetectionCOCO test-devbox mAP64.4FocalNet-H (DINO)
Object DetectionCOCO minivalbox AP64.2FocalNet-H (DINO)
Object DetectionCOCO minivalAP5070.3FocalNet-T (LRF, Cascade Mask R-CNN)
Object DetectionCOCO minivalAP7556FocalNet-T (LRF, Cascade Mask R-CNN)
Object DetectionCOCO minivalbox AP51.5FocalNet-T (LRF, Cascade Mask R-CNN)
Object DetectionCOCO minivalAP5070.1FocalNet-T (SRF, Cascade Mask R-CNN)
Object DetectionCOCO minivalAP7555.8FocalNet-T (SRF, Cascade Mask R-CNN)
3DCOCO test-devbox mAP64.4FocalNet-H (DINO)
3DCOCO minivalbox AP64.2FocalNet-H (DINO)
3DCOCO minivalAP5070.3FocalNet-T (LRF, Cascade Mask R-CNN)
3DCOCO minivalAP7556FocalNet-T (LRF, Cascade Mask R-CNN)
3DCOCO minivalbox AP51.5FocalNet-T (LRF, Cascade Mask R-CNN)
3DCOCO minivalAP5070.1FocalNet-T (SRF, Cascade Mask R-CNN)
3DCOCO minivalAP7555.8FocalNet-T (SRF, Cascade Mask R-CNN)
2D ClassificationCOCO test-devbox mAP64.4FocalNet-H (DINO)
2D ClassificationCOCO minivalbox AP64.2FocalNet-H (DINO)
2D ClassificationCOCO minivalAP5070.3FocalNet-T (LRF, Cascade Mask R-CNN)
2D ClassificationCOCO minivalAP7556FocalNet-T (LRF, Cascade Mask R-CNN)
2D ClassificationCOCO minivalbox AP51.5FocalNet-T (LRF, Cascade Mask R-CNN)
2D ClassificationCOCO minivalAP5070.1FocalNet-T (SRF, Cascade Mask R-CNN)
2D ClassificationCOCO minivalAP7555.8FocalNet-T (SRF, Cascade Mask R-CNN)
2D Object DetectionCOCO test-devbox mAP64.4FocalNet-H (DINO)
2D Object DetectionCOCO minivalbox AP64.2FocalNet-H (DINO)
2D Object DetectionCOCO minivalAP5070.3FocalNet-T (LRF, Cascade Mask R-CNN)
2D Object DetectionCOCO minivalAP7556FocalNet-T (LRF, Cascade Mask R-CNN)
2D Object DetectionCOCO minivalbox AP51.5FocalNet-T (LRF, Cascade Mask R-CNN)
2D Object DetectionCOCO minivalAP5070.1FocalNet-T (SRF, Cascade Mask R-CNN)
2D Object DetectionCOCO minivalAP7555.8FocalNet-T (SRF, Cascade Mask R-CNN)
10-shot image generationADE20KValidation mIoU58.5FocalNet-L (Mask2Former)
10-shot image generationCOCO minivalAP48.4FocalNet-L (Mask2Former (200 queries))
10-shot image generationCOCO minivalPQ57.9FocalNet-L (Mask2Former (200 queries))
Panoptic SegmentationCOCO minivalAP48.4FocalNet-L (Mask2Former (200 queries))
Panoptic SegmentationCOCO minivalPQ57.9FocalNet-L (Mask2Former (200 queries))
16kCOCO test-devbox mAP64.4FocalNet-H (DINO)
16kCOCO minivalbox AP64.2FocalNet-H (DINO)
16kCOCO minivalAP5070.3FocalNet-T (LRF, Cascade Mask R-CNN)
16kCOCO minivalAP7556FocalNet-T (LRF, Cascade Mask R-CNN)
16kCOCO minivalbox AP51.5FocalNet-T (LRF, Cascade Mask R-CNN)
16kCOCO minivalAP5070.1FocalNet-T (SRF, Cascade Mask R-CNN)
16kCOCO minivalAP7555.8FocalNet-T (SRF, Cascade Mask R-CNN)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17Deep Learning-Based Fetal Lung Segmentation from Diffusion-weighted MRI Images and Lung Maturity Evaluation for Fetal Growth Restriction2025-07-17DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17