Neural CRF Model for Sentence Alignment in Text Simplification

Chao Jiang, Mounica Maddela, Wuwei Lan, Yang Zhong, Wei Xu

2020-05-05ACL 2020 6Semantic Similarity Semantic Textual Similarity Text Simplification

Abstract

The success of a text simplification system heavily depends on the quality and quantity of complex-simple sentence pairs in the training corpus, which are extracted by aligning sentences between parallel articles. To evaluate and improve sentence alignment quality, we create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia. We propose a novel neural CRF alignment model which not only leverages the sequential nature of sentences in parallel documents but also utilizes a neural sentence pair model to capture semantic similarity. Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1. We apply our CRF aligner to construct two new text simplification datasets, Newsela-Auto and Wiki-Auto, which are much larger and of better quality compared to the existing datasets. A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation.

Results

Task	Dataset	Metric	Value	Model
Text Simplification	Newsela	SARI	36.6	CRF Alignment + Transformer

Related Papers

SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17 SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression2025-07-08 FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection2025-07-06 LineRetriever: Planning-Aware Observation Reduction for Web Agents2025-06-30 Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval2025-06-26 DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning2025-06-26 Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation2025-06-25 Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models2025-06-25