Gradient-Based Subword Tokenization
GBST
Description
GBST, or Gradient-based Subword Tokenization Module, is a soft gradient-based subword tokenization module that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network.
GBST learns a position-wise soft selection over candidate subword blocks by scoring them with a scoring network. In contrast to prior tokenization-free methods, GBST learns interpretable latent subwords, which enables easy inspection of lexical representations and is more efficient than other byte-based models.
Papers Using This Method
CharFormer: A Glyph Fusion based Attentive Framework for High-precision Character Image Denoising2022-07-16Patching Leaks in the Charformer for Efficient Character-Level Generation2022-05-27A New Generation of Perspective API: Efficient Multilingual Character-level Transformers2022-02-22Patching Leaks in the Charformer for Generative Tasks2022-01-16Charformer: Fast Character Transformers via Gradient-based Subword Tokenization2021-06-23