Patch Merger

Patch Merger Module

Computer VisionIntroduced 20001 papers

Description

PatchMerger is a module for Vision Transformers that decreases the number of tokens/patches passed onto each individual transformer encoder block whilst maintaining performance and reducing compute. PatchMerger takes linearly transforms an input of shape N patches × D dimensions through a learnable weight matrix of shape M output patches × D. This generates M scores, in which a Softmax function is applied for each score. The resulting output has a shape of M × N, which is multiplied to the original input to get an output of shape M × D.

Mathematically, Y=softmax(WTXT)XY = \text{softmax}({W^T}{X^T})X

Image and formula from: Renggli, C., Pinto, A. S., Houlsby, N., Mustafa, B., Puigcerver, J., & Riquelme, C. (2022). Learning to Merge Tokens in Vision Transformers. arXiv preprint arXiv:2202.12015.

Papers Using This Method