TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Spatial Attention-Guided Mask

Spatial Attention-Guided Mask

GeneralIntroduced 20007 papers
Source Paper

Description

A Spatial Attention-Guided Mask is a module for instance segmentation that predicts a segmentation mask on each detected box with a spatial attention map that helps to focus on informative pixels and suppress noise. The goal is to guide the mask head for spotlighting meaningful pixels and repressing uninformative ones.

Once features inside the predicted RoIs are extracted by RoIAlign with 14×14 resolution, those features are fed into four conv layers and the spatial attention module (SAM) sequentially. To exploit the spatial attention map A_sag(X_i)∈R1×W×HA\_{sag}\left(X\_{i}\right) \in \mathcal{R}^{1\times{W}\times{H}}A_sag(X_i)∈R1×W×H as a feature descriptor given input feature map X_i∈RC×W×HX\_{i} \in \mathcal{R}^{C×W×H}X_i∈RC×W×H, the SAM first generates pooled features P_avg,P_max∈R1×W×HP\_{avg}, P\_{max} \in \mathcal{R}^{1\times{W}\times{H}}P_avg,P_max∈R1×W×H by both average and max pooling operations respectively along the channel axis and aggregates them via concatenation. Then it is followed by a 3 × 3 conv layer and normalized by the sigmoid function. The computation process is summarized as follow:

A_sag(X_i)=σ(F_3×3(P_max⋅P_avg))A\_{sag}\left(X\_{i}\right) = \sigma\left(F\_{3\times{3}}(P\_{max} \cdot P\_{avg})\right)A_sag(X_i)=σ(F_3×3(P_max⋅P_avg))

where σ\sigmaσ denotes the sigmoid function, F_3×3F\_{3\times{3}}F_3×3 is 3 × 3 conv layer and ⋅\cdot⋅ represents the concatenate operation. Finally, the attention guided feature map X_sag∈RC×W×HX\_{sag} ∈ \mathcal{R}^{C\times{W}\times{H}}X_sag∈RC×W×H is computed as:

X_sag=A_sag(X_i)⊗X_iX\_{sag} = A\_{sag}\left(X\_{i}\right) \otimes X\_{i}X_sag=A_sag(X_i)⊗X_i

where ⊗ denotes element-wise multiplication. After then, a 2 × 2 deconv upsamples the spatially attended feature map to 28 × 28 resolution. Lastly, a 1 × 1 conv is applied for predicting class-specific masks.

Papers Using This Method

Polite Teacher: Semi-Supervised Instance Segmentation with Mutual Learning and Pseudo-Label Thresholding2022-11-07Intelligent detect for substation insulator defects based on CenterMask2022-08-31CenterMask: single shot instance segmentation with point representation2020-04-09SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation2020-04-07Learning Oracle Attention for High-fidelity Face Completion2020-03-31Context-Aware Domain Adaptation in Semantic Segmentation2020-03-09CenterMask : Real-Time Anchor-Free Instance Segmentation2019-11-15