TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Class Attention

Class Attention

GeneralIntroduced 200036 papers
Source Paper

Description

A Class Attention layer, or CA Layer, is an attention mechanism for vision transformers used in CaiT that aims to extract information from a set of processed patches. It is identical to a self-attention layer, except that it relies on the attention between (i) the class embedding xclass x_{\text {class }}xclass ​ (initialized at CLS in the first CA) and (ii) itself plus the set of frozen patch embeddings xpatches .x_{\text {patches }} .xpatches ​.

Considering a network with hhh heads and ppp patches, and denoting by ddd the embedding size, the multi-head class-attention is parameterized with several projection matrices, Wq,Wk,Wv,Wo∈Rd×dW_{q}, W_{k}, W_{v}, W_{o} \in \mathbf{R}^{d \times d}Wq​,Wk​,Wv​,Wo​∈Rd×d, and the corresponding biases bq,bk,bv,bo∈Rd.b_{q}, b_{k}, b_{v}, b_{o} \in \mathbf{R}^{d} .bq​,bk​,bv​,bo​∈Rd. With this notation, the computation of the CA residual block proceeds as follows. We first augment the patch embeddings (in matrix form) as z=[xclass ,xpatches ]z=\left[x_{\text {class }}, x_{\text {patches }}\right]z=[xclass ​,xpatches ​]. We then perform the projections:

Q=W_qx_class +b_qQ=W\_{q} x\_{\text {class }}+b\_{q}Q=W_qx_class +b_q

K=W_kz+b_kK=W\_{k} z+b\_{k}K=W_kz+b_k

V=W_vz+b_vV=W\_{v} z+b\_{v}V=W_vz+b_v

The class-attention weights are given by

A=Softmax⁡(Q.KT/d/h)A=\operatorname{Softmax}\left(Q . K^{T} / \sqrt{d / h}\right)A=Softmax(Q.KT/d/h​)

where Q.KT∈Rh×1×pQ . K^{T} \in \mathbf{R}^{h \times 1 \times p}Q.KT∈Rh×1×p. This attention is involved in the weighted sum A×VA \times VA×V to produce the residual output vector

out⁡_CA=W_oAV+b_o\operatorname{out}\_{\mathrm{CA}}=W\_{o} A V+b\_{o}out_CA=W_oAV+b_o

which is in turn added to x_class x\_{\text {class }}x_class  for subsequent processing.

Papers Using This Method

Prompt-CAM: A Simpler Interpretable Transformer for Fine-Grained Analysis2025-01-16Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis2025-01-01SpectralKD: A Unified Framework for Interpreting and Distilling Vision Transformers via Spectral Analysis2024-12-26An Attention-based Representation Distillation Baseline for Multi-Label Continual Learning2024-07-19LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition2024-01-08Self-distilled Masked Attention guided masked image modeling with noise Regularized Teacher (SMART) for medical image analysis2023-10-02Class Attention Transfer Based Knowledge Distillation2023-04-25SACANet: scene-aware class attention network for semantic segmentation of remote sensing images2023-04-22Detecting Severity of Diabetic Retinopathy from Fundus Images: A Transformer Network-based Review2023-01-03Bidirectional Representations for Low Resource Spoken Language Understanding2022-11-24Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention2022-09-28Class-attention Video Transformer for Engagement Intensity Prediction2022-08-12MaiT: Leverage Attention Masks for More Efficient Image Transformers2022-07-06Augmenting Convolutional networks with attention-based aggregation2021-12-27SSA: Semantic Structure Aware Inference for Weakly Pixel-Wise Dense Predictions without Cost2021-11-05MaiT: integrating spatial locality into image transformers with attention masks2021-09-29Is cell segregation like oil and water: asymptotic versus transitory regime2021-09-01Dynamic Relevance Learning for Few-Shot Object Detection2021-08-04Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer2021-08-03A comparison of latent semantic analysis and correspondence analysis of document-term matrices2021-07-25