TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Text classification with word embedding regularization and...

Text classification with word embedding regularization and soft similarity measure

Vít Novotný, Eniafe Festus Ayetiran, Michal Štefánik, Petr Sojka

2020-03-10Text ClassificationWord Similaritytext similarityWord EmbeddingsDocument Classificationtext-classificationGeneral ClassificationClassification
PaperPDFCode(official)

Abstract

Since the seminal work of Mikolov et al., word embeddings have become the preferred word representations for many natural language processing tasks. Document similarity measures extracted from word embeddings, such as the soft cosine measure (SCM) and the Word Mover's Distance (WMD), were reported to achieve state-of-the-art performance on semantic text similarity and text classification. Despite the strong performance of the WMD on text classification and semantic text similarity, its super-cubic average time complexity is impractical. The SCM has quadratic worst-case time complexity, but its performance on text classification has never been compared with the WMD. Recently, two word embedding regularization techniques were shown to reduce storage and memory costs, and to improve training speed, document processing speed, and task performance on word analogy, word similarity, and semantic text similarity. However, the effect of these techniques on text classification has not yet been studied. In our work, we investigate the individual and joint effect of the two word embedding regularization techniques on the document processing speed and the task performance of the SCM and the WMD on text classification. For evaluation, we use the $k$NN classifier and six standard datasets: BBCSPORT, TWITTER, OHSUMED, REUTERS-21578, AMAZON, and 20NEWS. We show 39% average $k$NN test error reduction with regularized word embeddings compared to non-regularized word embeddings. We describe a practical procedure for deriving such regularized embeddings through Cholesky factorization. We also show that the SCM with regularized word embeddings significantly outperforms the WMD on text classification and is over 10,000 times faster.

Results

TaskDatasetMetricValueModel
Text Classification20NEWSAccuracy70.28Orthogonalized Soft VSM
Text ClassificationReuters-21578Accuracy92.65Orthogonalized Soft VSM
Text ClassificationBBCSportAccuracy97.73Orthogonalized Soft VSM
Text ClassificationTwitterAccuracy69.21Orthogonalized Soft VSM
Text ClassificationAmazonAccuracy93.42Orthogonalized Soft VSM
Document ClassificationReuters-21578Accuracy92.65Orthogonalized Soft VSM
Document ClassificationBBCSportAccuracy97.73Orthogonalized Soft VSM
Document ClassificationTwitterAccuracy69.21Orthogonalized Soft VSM
Document ClassificationAmazonAccuracy93.42Orthogonalized Soft VSM
Classification20NEWSAccuracy70.28Orthogonalized Soft VSM
ClassificationReuters-21578Accuracy92.65Orthogonalized Soft VSM
ClassificationBBCSportAccuracy97.73Orthogonalized Soft VSM
ClassificationTwitterAccuracy69.21Orthogonalized Soft VSM
ClassificationAmazonAccuracy93.42Orthogonalized Soft VSM

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Safeguarding Federated Learning-based Road Condition Classification2025-07-16AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)2025-07-13GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation2025-07-10Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09Fuzzy Classification Aggregation for a Continuum of Agents2025-07-06