Patching Leaks in the Charformer for Generative Tasks

Anonymous

2022-01-16ACL ARR January 2022 1NMT

Abstract

Character-based representations have important advantages over subword-based ones, including increased robustness to noisy input and removing the need of tokenization preprocessing. However, they also have a crucial disadvantage: they notably increase the length of text sequences. The GBST method from Charformer groups (aka downsamples) characters to solve this, but allows information to leak when applied to a Transformer decoder. We introduce novel methodology to solve this information leak issue, which opens up the possibility of using character grouping in the decoder. We show that Charformer downsampling has no apparent benefits in NMT over previous downsampling methods.

Related Papers

Asymptotically Smaller Encodings for Graph Problems and Scheduling2025-06-16 ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT2025-06-05 Enhancing Meme Token Market Transparency: A Multi-Dimensional Entity-Linked Address Analysis for Liquidity Risk Evaluation2025-05-22 In-Domain African Languages Translation Using LLMs and Multi-armed Bandits2025-05-21 THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation2025-05-20 Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation2025-05-19 Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation2025-05-13 Do Not Change Me: On Transferring Entities Without Modification in Neural Machine Translation -- a Multilingual Perspective2025-05-09