arXiv Categories
arXiv Categories Multi-label Text Classification Dataset
TextsMITIntroduced 2024-10-08
This is a dataset of scientific documents derived from arXiv. It comprises 203,961 titles and abstracts categorized into 130 different classes from the arXiv category taxonomy. Each document (title+abstract) is categorized into one or more distinct classes. It is split into train (163,168), validation (20,396), and test (20,397) sets.