Papers With Code 2 | ML Benchmarks, SotA Results & Code

A DeLighT Block is a block used in the DeLighT transformer architecture. It uses a DExTra transformation to reduce the dimensionality of the vectors entered into the attention layer, where a single-headed attention module is used. Since the DeLighT block learns wider representations of the input across different layers using DExTra, it enables the authors to replace multi-head attention with single-head attention. This is then followed by a light-weight FFN which, rather than expanding the dimension (as in normal Transformers which widen to a dimension 4x the size), imposes a bottleneck and squeezes the dimensions. Again, the reason for this is that the DExTra transformation has already incorporated wider representations so we can squeeze instead at this layer.

DeLighT Block

Description

Papers Using This Method