A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization

Shohei Higashiyama, Masao Utiyama, Taro Watanabe, Eiichiro Sumita

2021-11-01WNUT (ACL) 2021 11POS Part-Of-Speech Tagging Lexical Normalization POS Tagging

Abstract

Lexical normalization, in addition to word segmentation and part-of-speech tagging, is a fundamental task for Japanese user-generated text processing. In this paper, we propose a text editing model to solve the three task jointly and methods of pseudo-labeled data generation to overcome the problem of data deficiency. Our experiments showed that the proposed model achieved better normalization performance when trained on more diverse pseudo-labeled data.

Related Papers

LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops2025-06-17 Hybrid Meta-learners for Estimating Heterogeneous Treatment Effects2025-06-16 Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs2025-06-11 Private MEV Protection RPCs: Benchmark Stud2025-05-26 FiLLM -- A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)2025-05-25 On Multilingual Encoder Language Model Compression for Low-Resource Languages2025-05-22 The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations2025-05-18 A Comparative Analysis of Static Word Embeddings for Hungarian2025-05-12