TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Segment Any Text: A Universal Approach for Robust, Efficie...

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Markus Frohmann, Igor Sterner, Ivan Vulić, Benjamin Minixhofer, Markus Schedl

2024-06-24Sentence segmentationparameter-efficient fine-tuning
PaperPDFCodeCodeCode(official)

Abstract

Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://github.com/segment-any-text/wtpsplit under the MIT license.

Related Papers

Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization2025-07-06Exploring Adapter Design Tradeoffs for Low Resource Music Generation2025-06-26WordCon: Word-level Typography Control in Scene Text Rendering2025-06-26Optimising Language Models for Downstream Tasks: A Post-Training Perspective2025-06-26Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models2025-06-26Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models2025-06-26ARD-LoRA: Dynamic Rank Allocation for Parameter-Efficient Fine-Tuning of Foundation Models with Heterogeneous Adaptation Needs2025-06-23