TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MoNoise: Modeling Noise Using a Modular Normalization System

MoNoise: Modeling Noise Using a Modular Normalization System

Rob van der Goot, Gertjan van Noord

2017-10-10Spelling CorrectionWord EmbeddingsLexical Normalization
PaperPDFCode(official)Code

Abstract

We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a non- canonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a different type of normalization action. The most important generation modules are a spelling correction system and a word embeddings module. Depending on the definition of the normalization task, a static lookup list can be crucial for performance. We train a random forest classifier to rank the candidates, which generalizes well to all different types of normaliza- tion actions. Most features for the ranking originate from the generation modules; besides these features, N-gram features prove to be an important source of information. We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.

Results

TaskDatasetMetricValueModel
Lexical NormalizationLexNormAccuracy87.63MoNoise

Related Papers

Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models2025-06-30Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition2025-06-23Low-resource keyword spotting using contrastively trained transformer acoustic word embeddings2025-06-21Characterizing Linguistic Shifts in Croatian News via Diachronic Word Embeddings2025-06-16Learning Obfuscations Of LLM Embedding Sequences: Stained Glass Transform2025-06-11Recommender systems, stigmergy, and the tyranny of popularity2025-06-06Static Word Embeddings for Sentence Semantic Representation2025-06-05