MoNoise: Modeling Noise Using a Modular Normalization System

Rob van der Goot, Gertjan van Noord

2017-10-10Spelling Correction Word Embeddings Lexical Normalization

Abstract

We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a non- canonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a different type of normalization action. The most important generation modules are a spelling correction system and a word embeddings module. Depending on the definition of the normalization task, a static lookup list can be crucial for performance. We train a random forest classifier to rank the candidates, which generalizes well to all different types of normaliza- tion actions. Most features for the ranking originate from the generation modules; besides these features, N-gram features prove to be an important source of information. We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.

Results

Task	Dataset	Metric	Value	Model
Lexical Normalization	LexNorm	Accuracy	87.63	MoNoise

Related Papers

Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09 Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models2025-06-30 Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23 Low-resource keyword spotting using contrastively trained transformer acoustic word embeddings2025-06-21 Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings2025-06-16 Learning Obfuscations Of LLM Embedding Sequences: Stained Glass Transform2025-06-11 Recommender systems, stigmergy, and the tyranny of popularity2025-06-06 Static Word Embeddings for Sentence Semantic Representation2025-06-05