TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/On the Role of Text Preprocessing in Neural Network Archit...

On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis

Jose Camacho-Collados, Mohammad Taher Pilehvar

2017-07-06WS 2018 11Text ClassificationSentiment AnalysisText CategorizationWord Embeddings
PaperPDFCodeCodeCode(official)

Abstract

Text preprocessing is often the first step in the pipeline of a Natural Language Processing (NLP) system, with potential impact in its final performance. Despite its importance, text preprocessing has not received much attention in the deep learning literature. In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a standard neural text classifier. We perform an extensive evaluation on standard benchmarks from text categorization and sentiment analysis. While our experiments show that a simple tokenization of input text is generally adequate, they also highlight significant degrees of variability across preprocessing techniques. This reveals the importance of paying attention to this usually-overlooked step in the pipeline, particularly when comparing different models. Finally, our evaluation provides insights into the best preprocessing practices for training word embeddings.

Results

TaskDatasetMetricValueModel
Sentiment AnalysisSST-2 Binary classificationAccuracy91.2CNN
Sentiment AnalysisIMDbAccuracy88.9CNN+LSTM
Text ClassificationOhsumedAccuracy36.2CNN+Lowercased
ClassificationOhsumedAccuracy36.2CNN+Lowercased

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles2025-07-15DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15SentiDrop: A Multi Modal Machine Learning model for Predicting Dropout in Distance Learning2025-07-14GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation2025-07-10Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09The Trilemma of Truth in Large Language Models2025-06-30