Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

Peijie Jiang, Dingkun Long, Yanzhao Zhang, Pengjun Xie, Meishan Zhang, Min Zhang

2022-10-27Chinese Word Segmentation Part-Of-Speech Tagging Named Entity Recognition Chinese Named Entity Recognition Named Entity Recognition (NER)Language Modelling

Paper PDF Code Code(official)

Abstract

Boundary information is critical for various Chinese language processing tasks, such as word segmentation, part-of-speech tagging, and named entity recognition. Previous studies usually resorted to the use of a high-quality external lexicon, where lexicon items can offer explicit boundary information. However, to ensure the quality of the lexicon, great human effort is always necessary, which has been generally ignored. In this work, we suggest unsupervised statistical boundary information instead, and propose an architecture to encode the information directly into pre-trained language models, resulting in Boundary-Aware BERT (BABERT). We apply BABERT for feature induction of Chinese sequence labeling tasks. Experimental results on ten benchmarks of Chinese sequence labeling demonstrate that BABERT can provide consistent improvements on all datasets. In addition, our method can complement previous supervised lexicon exploration, where further improvements can be achieved when integrated with external lexicon information.

Results

Task	Dataset	Metric	Value	Model
Chinese	CTB6	F1	97.56	BABERT-LE
Chinese	CTB6	F1	97.45	BABERT
Chinese	PKU	F1	96.84	BABERT-LE
Chinese	PKU	F1	96.7	BABERT
Chinese	MSRA	F1	98.63	BABERT-LE
Chinese	MSRA	F1	98.44	BABERT
Chinese	MSR	F1	98.63	BABERT-LE
Chinese	MSR	F1	98.44	BABERT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17 Assay2Mol: large language model-based drug design using BioAssay context2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16