285 machine learning datasets
285 dataset results
Regression dataset for molecular docking scores (predicted molecule-protein binding affinity). Contains ~250,000 molecules against 58 protein targets.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
MalNet is a large public graph database, representing a large-scale ontology of software function call graphs. MalNet contains over 1.2 million graphs, averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families.
Node classification on Chameleon with 60%/20%/20% random splits for training/validation/test.
SketchGraphs is a dataset of 15 million sketches extracted from real-world CAD models intended to facilitate research in both ML-aided design and geometric program induction. Each sketch is represented as a geometric constraint graph where edges denote designer-imposed geometric relationships between primitives, the nodes of the graph.
This dataset is a Wikipedia dump, split by relations to perform Few-Shot Knowledge Graph Completion.
Node classification on Cornell with 60%/20%/20% random splits for training/validation/test.
Node classification on Texas with 60%/20%/20% random splits for training/validation/test.
Node classification on Cornell with the fixed 48%/32%/20% splits provided by Geom-GCN.
BeerAdvocate is a dataset that consists of beer reviews from beeradvocate. The data span a period of more than 10 years, including all ~1.5 million reviews up to November 2011. Each review includes ratings in terms of five "aspects": appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review.
Node classification on Wisconsin with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on Cora with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on Citeseer with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on PubMed with the fixed 48%/32%/20% splits provided by Geom-GCN.
The LINUX dataset consists of 48,747 Program Dependence Graphs (PDG) generated from the Linux kernel. Each graph represents a function, where a node represents one statement and an edge represents the dependency between the two statements
Node classification on Texas with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on Film with the fixed 48%/32%/20% splits provided by Geom-GCN.
Mutagenicity is a chemical compound dataset of drugs, which can be categorized into two classes: mutagen and non-mutagen.
Provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions.
For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS.