Abstract
Character-based representations have important advantages over subword-based ones, including increased robustness to noisy input and removing the need of tokenization preprocessing. However, they also have a crucial disadvantage: they notably increase the length of text sequences. The GBST method from Charformer groups (aka downsamples) characters to solve this, but allows information to leak when applied to a Transformer decoder. We introduce novel methodology to solve this information leak issue, which opens up the possibility of using character grouping in the decoder. We show that Charformer downsampling has no apparent benefits in NMT over previous downsampling methods.
Related Papers
Asymptotically Smaller Encodings for Graph Problems and Scheduling2025-06-16ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT2025-06-05Enhancing Meme Token Market Transparency: A Multi-Dimensional Entity-Linked Address Analysis for Liquidity Risk Evaluation2025-05-22In-Domain African Languages Translation Using LLMs and Multi-armed Bandits2025-05-21THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation2025-05-20Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation2025-05-19Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation2025-05-13Do Not Change Me: On Transferring Entities Without Modification in Neural Machine Translation -- a Multilingual Perspective2025-05-09