Datasets

285 machine learning datasets

285 dataset results

Wiki Squirrel (Wikipedia Squirrel)

The data was collected from the English Wikipedia (December 2018). These datasets represent page-page networks on specific topics (chameleons, crocodiles and squirrels). Nodes represent articles and edges are mutual links between them. The edges csv files contain the edges - nodes are indexed from 0. The features json files contain the features of articles - each key is a page id, and node features are given as lists. The presence of a feature in the feature list means that an informative noun appeared in the text of the Wikipedia article. The target csv contains the node identifiers and the average monthly traffic between October 2017 and November 2018 for each page. For each page-page network we listed the number of nodes an edges with some other descriptive statistics.

208 papers0 benchmarksGraphs

ENZYMES

ENZYMES is a dataset of 600 protein tertiary structures obtained from the BRENDA enzyme database. The ENZYMES dataset contains 6 enzymes.

202 papers4 benchmarksGraphs

SNAP (Stanford Large Network Dataset Collection)

SNAP is a collection of large network datasets. It includes graphs representing social networks, citation networks, web graphs, online communities, online reviews and more.

174 papers0 benchmarksGraphs

CLUSTER

CLUSTER is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social networks by modulating the intra- and extra-communities connections, thereby controlling the difficulty of the task. CLUSTER aims at identifying community clusters in a semi-supervised setting.

159 papers1 benchmarksGraphs

Materials Project

The Materials Project is a collection of chemical compounds labelled with different attributes. The labelling is performed by different simulations, most of them at DFT level of theory.

157 papers2 benchmarksGraphs

PATTERN

PATTERN is a node classification tasks generated with Stochastic Block Models, which is widely used to model communities in social networks by modulating the intra- and extra-communities connections, thereby controlling the difficulty of the task. PATTERN tests the fundamental graph task of recognizing specific predetermined subgraphs.

153 papers1 benchmarksGraphs

REDDIT-BINARY

REDDIT-BINARY consists of graphs corresponding to online discussions on Reddit. In each graph, nodes represent users, and there is an edge between them if at least one of them respond to the other’s comment. There are four popular subreddits, namely, IAmA, AskReddit, TrollXChromosomes, and atheism. IAmA and AskReddit are two question/answer based subreddits, and TrollXChromosomes and atheism are two discussion-based subreddits. A graph is labeled according to whether it belongs to a question/answer-based community or a discussion-based community.

150 papers4 benchmarksGraphs

WebKB

WebKB is a dataset that includes web pages from computer science departments of various universities. 4,518 web pages are categorized into 6 imbalanced categories (Student, Faculty, Staff, Department, Course, Project). Additionally there is Other miscellanea category that is not comparable to the rest.

113 papers0 benchmarksGraphs

PTC (Predictive Toxicology Challenge)

PTC is a collection of 344 chemical compounds represented as graphs which report the carcinogenicity for rats. There are 19 node labels for each node.

111 papers2 benchmarksGraphs

Wiki-CS

Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.

111 papers1 benchmarksGraphs

NCI109

Tudataset: A collection of benchmark datasets for learning with graphs

96 papers2 benchmarksGraphs

Orkut

Orkut is a social network dataset consisting of friendship social network and ground-truth communities from Orkut.com on-line social network where users form friendship each other.

85 papers0 benchmarksGraphs

AMiner

The AMiner Dataset is a collection of different relational datasets. It consists of a set of relational networks such as citation networks, academic social networks or topic-paper-autor networks among others.

80 papers1 benchmarksGraphs

MOSES (Molecular sets (MOSES))

The set is based on the ZINC Clean Leads collection. It contains 4,591,276 molecules in total, filtered by molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5. We removed molecules containing charged atoms or atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms. The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters.

79 papers0 benchmarksGraphs

REDDIT-5K (REDDIT-MULTI-5K)

Reddit-5K is a relational dataset extracted from Reddit.

78 papers0 benchmarksGraphs

RadGraph (RadGraph: Extracting Clinical Entities and Relations from Radiology Reports)

RadGraph is a dataset of entities and relations in radiology reports based on our novel information extraction schema, consisting of 600 reports with 30K radiologist annotations and 221K reports with 10.5M automatically generated annotations.

78 papers0 benchmarksGraphs, Medical, Texts

Long Range Graph Benchmark (LRGB)

The Long Range Graph Benchmark (LRGB) is a collection of 5 graph learning datasets that arguably require long-range reasoning to achieve strong performance in a given task. The 5 datasets in this benchmark can be used to prototype new models that can capture long range dependencies in graphs.

76 papers0 benchmarksGraphs

AIDS

AIDS is a graph dataset. It consists of 2000 graphs representing molecular compounds which are constructed from the AIDS Antiviral Screen Database of Active Compounds. It contains 4395 chemical compounds, of which 423 belong to class CA, 1081 to CM, and the remaining compounds to CI.

67 papers5 benchmarksGraphs

Reuters-21578

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary of 29,930 words.

66 papers12 benchmarksGraphs

Friendster

Friendster is an on-line gaming network. Before re-launching as a game website, Friendster was a social networking site where users can form friendship edge each other. Friendster social network also allows users form a group which other members can then join. The Friendster dataset consist of ground-truth communities (based on user-defined groups) and the social network from induced subgraph of the nodes that either belong to at least one community or are connected to other nodes that belong to at least one community.

65 papers0 benchmarksGraphs

PreviousPage 2 of 15Next