Efficient Vector Representation for Documents through Corruption

Minmin Chen

2017-07-08Representation Learning Sentiment Analysis Word Embeddings Document Classification

Abstract

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

Results

Task	Dataset	Metric	Value	Model
Language Modelling	SICK	MSE	0.3053	Doc2VecC
Language Modelling	SICK	Pearson Correlation	0.8381	Doc2VecC
Language Modelling	SICK	Spearman Correlation	0.7621	Doc2VecC
Sentiment Analysis	IMDb	Accuracy	88.3	Doc2VecC
Sentence Pair Modeling	SICK	MSE	0.3053	Doc2VecC
Sentence Pair Modeling	SICK	Pearson Correlation	0.8381	Doc2VecC
Sentence Pair Modeling	SICK	Spearman Correlation	0.7621	Doc2VecC
Semantic Similarity	SICK	MSE	0.3053	Doc2VecC
Semantic Similarity	SICK	Pearson Correlation	0.8381	Doc2VecC
Semantic Similarity	SICK	Spearman Correlation	0.7621	Doc2VecC

Related Papers

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper2025-07-20 Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17 Boosting Team Modeling through Tempo-Relational Representation Learning2025-07-17 AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 Are encoders able to learn landmarkers for warm-starting of Hyperparameter Optimization?2025-07-16 Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16 A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction2025-07-15