DocBERT: BERT for Document Classification

Ashutosh Adhikari, Achyudh Ram, Raphael Tang, Jimmy Lin

2019-04-17Text Classification Sentiment Analysis Document Classification General Classification Classification

Abstract

We present, to our knowledge, the first application of BERT to document classification. A few characteristics of the task might lead one to think that BERT is not the most appropriate model: syntactic structures matter less for content categories, documents can often be longer than typical BERT input, and documents often have multiple labels. Nevertheless, we show that a straightforward classification model using BERT is able to achieve the state of the art across four popular datasets. To address the computational expense associated with BERT inference, we distill knowledge from BERT-large to small bidirectional LSTMs, reaching BERT-base parity on multiple datasets using 30x fewer parameters. The primary contribution of our paper is improved baselines that can provide the foundation for future work.

Results

Task	Dataset	Metric	Value	Model
Text Classification	arXiv-10	Accuracy	0.764	DocBERT
Text Classification	Reuters-21578	F1	88.9	KD-LSTMreg
Text Classification	AAPD	F1	72.9	KD-LSTMreg
Text Classification	Yelp-14	Accuracy	69.4	KD-LSTMreg
Document Classification	Reuters-21578	F1	88.9	KD-LSTMreg
Document Classification	AAPD	F1	72.9	KD-LSTMreg
Document Classification	Yelp-14	Accuracy	69.4	KD-LSTMreg
Classification	arXiv-10	Accuracy	0.764	DocBERT
Classification	Reuters-21578	F1	88.9	KD-LSTMreg
Classification	AAPD	F1	72.9	KD-LSTMreg
Classification	Yelp-14	Accuracy	69.4	KD-LSTMreg

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17 AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17 Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17 Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16 Safeguarding Federated Learning-based Road Condition Classification2025-07-16 AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles2025-07-15 DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15 SentiDrop: A Multi Modal Machine Learning model for Predicting Dropout in Distance Learning2025-07-14