Latent Dirichlet Allocation

David M. Blei, Andrew Y. Ng, Michael I. Jordan

2003-01-01Text Classification Collaborative Filtering Text Categorization text-classification Topic Models

Abstract

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17 SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17 GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation2025-07-10 NLGCL: Naturally Existing Neighbor Layers Graph Contrastive Learning for Recommendation2025-07-10 From ID-based to ID-free: Rethinking ID Effectiveness in Multimodal Collaborative Filtering Recommendation2025-07-08 The Trilemma of Truth in Large Language Models2025-06-30 Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack2025-06-30 Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems2025-06-25