On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis

Jose Camacho-Collados, Mohammad Taher Pilehvar

2017-07-06WS 2018 11Text Classification Sentiment Analysis Text Categorization Word Embeddings

Abstract

Text preprocessing is often the first step in the pipeline of a Natural Language Processing (NLP) system, with potential impact in its final performance. Despite its importance, text preprocessing has not received much attention in the deep learning literature. In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a standard neural text classifier. We perform an extensive evaluation on standard benchmarks from text categorization and sentiment analysis. While our experiments show that a simple tokenization of input text is generally adequate, they also highlight significant degrees of variability across preprocessing techniques. This reveals the importance of paying attention to this usually-overlooked step in the pipeline, particularly when comparing different models. Finally, our evaluation provides insights into the best preprocessing practices for training word embeddings.

Results

Task	Dataset	Metric	Value	Model
Sentiment Analysis	SST-2 Binary classification	Accuracy	91.2	CNN
Sentiment Analysis	IMDb	Accuracy	88.9	CNN+LSTM
Text Classification	Ohsumed	Accuracy	36.2	CNN+Lowercased
Classification	Ohsumed	Accuracy	36.2	CNN+Lowercased

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17 AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17 AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles2025-07-15 DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15 SentiDrop: A Multi Modal Machine Learning model for Predicting Dropout in Distance Learning2025-07-14 GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation2025-07-10 Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09 The Trilemma of Truth in Large Language Models2025-06-30