A Character-level Ngram-based MT Approach for Lexical Normalization in Social Media

Anonymous

2021-12-17ACL ARR December 2022 12Lexical Normalization

Abstract

This paper presents an ngram-based MT approach that operates at character-level to generate possible canonical forms for lexical variants in social media text. It utilizes a joint n-gram model to learn edit sequences of word pairs, thus overcomes the shortage of phrase-based approach that is unable to capture dependencies across phrases. We evaluate our approach on two English tweet datasets and observe that the ngram-based approach significantly outperforms phrase-based approach in normalization task. Our simple model achieves a broad coverage on diverse variants which is comparable to complex hybrid systems.

Related Papers

ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization2025-01-13 A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media2024-09-30 ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text2024-01-29 Automatic Textual Normalization for Hate Speech Detection2023-11-12 Increasing Robustness for Cross-domain Dialogue Act Classification on Social Media Data2022-10-01 A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization2021-11-01 To What Extent Does Lexical Normalization Help English-as-a-Second Language Learners to Read Noisy English Texts?2021-11-01 Multilingual Sequence Labeling Approach to solve Lexical Normalization2021-11-01