TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/SCA-CNN

SCA-CNN

Spatial and Channel-wise Attention-based Convolutional Neural Network

GeneralIntroduced 20002 papers
Source Paper

Description

As CNN features are naturally spatial, channel-wise and multi-layer, Chen et al. proposed a novel spatial and channel-wise attention-based convolutional neural network (SCA-CNN). It was designed for the task of image captioning, and uses an encoder-decoder framework where a CNN first encodes an input image into a vector and then an LSTM decodes the vector into a sequence of words. Given an input feature map XXX and the previous time step LSTM hidden state ht−1∈Rdh_{t-1} \in \mathbb{R}^dht−1​∈Rd, a spatial attention mechanism pays more attention to the semantically useful regions, guided by LSTM hidden state ht−1h_{t-1}ht−1​. The spatial attention model is:

\begin{align} a(h_{t-1}, X) &= \tanh(Conv_1^{1 \times 1}(X) \oplus W_1 h_{t-1}) \end{align}

\begin{align} \Phi_s(h_{t-1}, X) &= \text{Softmax}(Conv_2^{1 \times 1}(a(h_{t-1}, X)))
\end{align}

where ⊕\oplus⊕ represents addition of a matrix and a vector. Similarly, channel-wise attention aggregates global information first, and then computes a channel-wise attention weight vector with the hidden state ht−1h_{t-1}ht−1​: \begin{align} b(h_{t-1}, X) &= \tanh((W_2\text{GAP}(X)+b_2)\oplus W_1h_{t-1}) \end{align} \begin{align} \Phi_c(h_{t-1}, X) &= \text{Softmax}(W_3(b(h_{t-1}, X))+b_3)
\end{align} Overall, the SCA mechanism can be written in one of two ways. If channel-wise attention is applied before spatial attention, we have \begin{align} Y &= f(X,\Phi_s(h_{t-1}, X \Phi_c(h_{t-1}, X)), \Phi_c(h_{t-1}, X)) \end{align} and if spatial attention comes first: \begin{align} Y &= f(X,\Phi_s(h_{t-1}, X), \Phi_c(h_{t-1}, X \Phi_s(h_{t-1}, X))) \end{align} where f(⋅)f(\cdot)f(⋅) denotes the modulate function which takes the feature map XXX and attention maps as input and then outputs the modulated feature map YYY.

Unlike previous attention mechanisms which consider each image region equally and use global spatial information to tell the network where to focus, SCA-Net leverages the semantic vector to produce the spatial attention map as well as the channel-wise attention weight vector. Being more than a powerful attention model, SCA-CNN also provides a better understanding of where and what the model should focus on during sentence generation.

Papers Using This Method

Aesthetic Attributes Assessment of Images2019-07-11SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning2016-11-17