TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets/Topic modeling topic coverage dataset

Topic modeling topic coverage dataset

TextsCC BY 4.0Introduced 2021-08-31

A prevalent use case of topic models is that of topic discovery. However, most of the topic model evaluation methods rely on abstract metrics such as perplexity or topic coherence. The topic coverage approach is to measure the models' performance by matching model-generated topics to topics discovered by humans. This way, the models are evaluated in the context of their use, by essentially simulating topic modeling in a fixed setting defined by a text collection and a set of reference topics.

Reference topics represent a ground truth that can be used to evaluate both topic models and other measures of model performance. The coverage approach enables large-scale automatic evaluation of both existing and future topic models.

The topic coverage dataset consists of two text collections and two sets of reference topics. These two sub-datasets correspond to two domains (news text and biological text) where topic models are used for topic discovery in large text collections. The reference topics consist of model-generated topics inspected, selected, and curated by humans.

Each dataset contains a corpus of preprocessed (tokenized) texts and a set of reference topics, each represented by a list of words and text documents. The dataset details, including the instruction for the use of the data and supporting code, are here: https://github.com/dkorenci/topic_coverage/blob/main/data.readme.txt

The coverage measures that can be used to evaluate topic models are described in the accompanying paper, whereas the code and the instructions can be found in the github repo.

Benchmarks

Classification/Spearman CorrelationText Classification/Spearman CorrelationTopic Models/Spearman Correlation

Related Benchmarks

Topic modeling topic coverage dataset - bio/Classification/AuCDCTopic modeling topic coverage dataset - bio/Classification/SupCovTopic modeling topic coverage dataset - bio/Text Classification/AuCDCTopic modeling topic coverage dataset - bio/Text Classification/SupCovTopic modeling topic coverage dataset - bio/Topic Models/AuCDCTopic modeling topic coverage dataset - bio/Topic Models/SupCovTopic modeling topic coverage dataset - news/Classification/AuCDCTopic modeling topic coverage dataset - news/Classification/SupCovTopic modeling topic coverage dataset - news/Text Classification/AuCDCTopic modeling topic coverage dataset - news/Text Classification/SupCovTopic modeling topic coverage dataset - news/Topic Models/AuCDCTopic modeling topic coverage dataset - news/Topic Models/SupCov

Statistics

Papers
1
Benchmarks
3

Links

Homepage

Tasks

ClassificationText ClassificationTopic ModelsTopic coverage