A Character-level Ngram-based MT Approach for Lexical Normalization in Social Media
Anonymous
Abstract
This paper presents an ngram-based MT approach that operates at character-level to generate possible canonical forms for lexical variants in social media text. It utilizes a joint n-gram model to learn edit sequences of word pairs, thus overcomes the shortage of phrase-based approach that is unable to capture dependencies across phrases. We evaluate our approach on two English tweet datasets and observe that the ngram-based approach significantly outperforms phrase-based approach in normalization task. Our simple model achieves a broad coverage on diverse variants which is comparable to complex hybrid systems.
Related Papers
ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization2025-01-13A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media2024-09-30ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text2024-01-29Automatic Textual Normalization for Hate Speech Detection2023-11-12Increasing Robustness for Cross-domain Dialogue Act Classification on Social Media Data2022-10-01A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization2021-11-01To What Extent Does Lexical Normalization Help English-as-a-Second Language Learners to Read Noisy English Texts?2021-11-01Multilingual Sequence Labeling Approach to solve Lexical Normalization2021-11-01