TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/GradDrop

GradDrop

Gradient Sign Dropout

GeneralIntroduced 20003 papers
Source Paper

Description

GradDrop, or Gradient Sign Dropout, is a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. It is applied as a layer in any standard network forward pass, usually on the final layer before the prediction head to save on compute overhead and maximize benefits during backpropagation. Below, we develop the GradDrop formalism. Throughout, o denotes elementwise multiplication after any necessary tiling operations (if any) are completed. To implement GradDrop, we first define the Gradient Positive Sign Purity, P\mathcal{P}P, as

\mathcal{P}=\frac{1}{2}\left(1+\frac{\sum\_{i} \nabla L_\{i}}{\sum\_{i}\left|\nabla L\_{i}\right|}\right)

P\mathcal{P}P is bounded by [0,1].[0,1] .[0,1]. For multiple gradient values ∇_aL_i\nabla\_{a} L\_{i}∇_aL_i at some scalar aaa, we see that P=0\mathcal{P}=0P=0 if ∇aL_i<0\nabla_{a} L\_{i}<0 ∇a​L_i<0 ∀i\forall i∀i, while P=1\mathcal{P}=1P=1 if ∇_aL_i>0\nabla\_{a} L\_{i}>0∇_aL_i>0 ∀i\forall i ∀i. Thus, P\mathcal{P}P is a measure of how many positive gradients are present at any given value. We then form a mask for each gradient M_i\mathcal{M}\_{i}M_i as follows:

M_i=I[f(P)>U]∘I[∇L_i>0]+I[f(P)<U]∘I[∇L_i<0]\mathcal{M}\_{i}=\mathcal{I}[f(\mathcal{P})>U] \circ \mathcal{I}\left[\nabla L\_{i}>0\right]+\mathcal{I}[f(\mathcal{P})<U] \circ \mathcal{I}\left[\nabla L\_{i}<0\right]M_i=I[f(P)>U]∘I[∇L_i>0]+I[f(P)<U]∘I[∇L_i<0]

for I\mathcal{I}I the standard indicator function and fff some monotonically increasing function (often just the identity) that maps [0,1]↦[0,1][0,1] \mapsto[0,1][0,1]↦[0,1] and is odd around (0.5,0.5)(0.5,0.5)(0.5,0.5). UUU is a tensor composed of i.i.d U(0,1)U(0,1)U(0,1) random variables. The M_i\mathcal{M}\_{i}M_i is then used to produce a final gradient ∑M_i∇L_i\sum \mathcal{M}\_{i} \nabla L\_{i}∑M_i∇L_i

Papers Using This Method

Gradient Sparsification For Masked Fine-Tuning of Transformers2023-07-19Gradient Sparsification For \emph{Masked Fine-Tuning} of Transformers2021-11-16Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout2020-10-14