TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Channel-wise Cross Attention

Channel-wise Cross Attention

Computer VisionIntroduced 20005 papers
Source Paper

Description

Channel-wise Cross Attention is a module for semantic segmentation used in the UCTransNet architecture. It is used to fuse features of inconsistent semantics between the Channel Transformer and U-Net decoder. It guides the channel and information filtration of the Transformer features and eliminates the ambiguity with the decoder features.

Mathematically, we take the iii-th level Transformer output O_i∈RC×H×W\mathbf{O\_{i}} \in \mathbb{R}^{C×H×W}O_i∈RC×H×W and i-th level decoder feature map D_i∈RC×H×W\mathbf{D\_{i}} \in \mathbb{R}^{C×H×W}D_i∈RC×H×W as the inputs of Channel-wise Cross Attention. Spatial squeeze is performed by a global average pooling (GAP) layer, producing vector G(X)∈RC×1×1\mathcal{G}\left(\mathbf{X}\right) \in \mathbb{R}^{C×1×1}G(X)∈RC×1×1 with its kkkth channel G(X)=1H×W∑H_i=1∑W_j=1Xk(i,j)\mathcal{G}\left(\mathbf{X}\right) = \frac{1}{H×W}\sum^{H}\_{i=1}\sum^{W}\_{j=1}\mathbf{X}^{k}\left(i, j\right)G(X)=H×W1​∑H_i=1∑W_j=1Xk(i,j). We use this operation to embed the global spatial information and then generate the attention mask:

M_i=L_1⋅G(O_i)+L_2⋅G(D_i)\mathbf{M}\_{i} = \mathbf{L}\_{1} \cdot \mathcal{G}\left(\mathbf{O\_{i}}\right) + \mathbf{L}\_{2} \cdot \mathcal{G}\left(\mathbf{D}\_{i}\right)M_i=L_1⋅G(O_i)+L_2⋅G(D_i)

where L_1∈RC×C\mathbf{L}\_{1} \in \mathbb{R}^{C×C}L_1∈RC×C and L_2∈RC×C\mathbf{L}\_{2} \in \mathbb{R}^{C×C}L_2∈RC×C and being weights of two Linear layers and the ReLU operator δ(⋅)\delta\left(\cdot\right)δ(⋅). This operation in the equation above encodes the channel-wise dependencies. Following ECA-Net which empirically showed avoiding dimensionality reduction is important for learning channel attention, the authors use a single Linear layer and sigmoid function to build the channel attention map. The resultant vector is used to recalibrate or excite O_i\mathbf{O\_{i}}O_i to Oˉ_i=σ(M_i)⋅O_i\mathbf{\bar{O}\_{i}} = \sigma\left(\mathbf{M\_{i}}\right) \cdot \mathbf{O\_{i}}Oˉ_i=σ(M_i)⋅O_i, where the activation σ(M_i)\sigma\left(\mathbf{M\_{i}}\right)σ(M_i) indicates the importance of each channel. Finally, the masked Oˉ_i\mathbf{\bar{O}}\_{i}Oˉ_i is concatenated with the up-sampled features of the iii-th level decoder.

Papers Using This Method

Enhancing Conditional Image Generation with Explainable Latent Space Manipulation2024-08-29Boosting Medical Image Segmentation Performance with Adaptive Convolution Layer2024-04-17ACC-UNet: A Completely Convolutional UNet model for the 2020s2023-08-25LViT: Language meets Vision Transformer in Medical Image Segmentation2022-06-29UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer2021-09-09