Datasets

285 machine learning datasets

285 dataset results

The Pinterest dataset contains more than 1 million images associated to Pinterest users’ who have “pinned” them.

Email-EU

EmailEU is a directed temporal network constructed from email exchanges in a large European research institution for a 803-day period. It contains 986 email addresses as nodes and 332,334 emails as edges with timestamps. There are 42 ground truth departments in the dataset.

33 papers0 benchmarksGraphs

Decagon (Bio-decagon)

Bio-decagon is a dataset for polypharmacy side effect identification problem framed as a multirelational link prediction problem in a two-layer multimodal graph/network of two node types: drugs and proteins. Protein-protein interaction network describes relationships between proteins. Drug-drug interaction network contains 964 different types of edges (one for each side effect type) and describes which drug pairs lead to which side effects. Lastly, drug-protein links describe the proteins targeted by a given drug.

33 papers3 benchmarksGraphs

amazon-ratings

amazon-ratings is a product co-purchasing network based on data from SNAP datasets

33 papers1 benchmarksGraphs

Worldtree

Worldtree is a corpus of explanation graphs, explanatory role ratings, and associated tablestore. It contains explanation graphs for 1,680 questions, and 4,950 tablestore rows across 62 semi-structured tables are provided. This data is intended to be paired with the AI2 Mercury Licensed questions.

32 papers0 benchmarksGraphs, Texts

LDC2017T10 (Abstract Meaning Representation (AMR) Annotation Release 2.0)

Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

27 papers2 benchmarksGraphs, Texts

REDDIT-12K

Reddit12k contains 11929 graphs each corresponding to an online discussion thread where nodes represent users, and an edge represents the fact that one of the two users responded to the comment of the other user. There is 1 of 11 graph labels associated with each of these 11929 discussion graphs, representing the category of the community.

24 papers2 benchmarksGraphs

twitch-gamers

node classification on twitch-gamers

24 papers2 benchmarksGraphs

LastFM Asia

A social network of LastFM users which was collected from the public API in March 2020. Nodes are LastFM users from Asian countries and edges are mutual follower relationships between them. The vertex features are extracted based on the artists liked by the users. The task related to the graph is multinomial node classification - one has to predict the location of users. This target feature was derived from the country field for each user.

21 papers0 benchmarksGraphs

Chameleon (48%/32%/20% fixed splits)

Node classification on Chameleon with the fixed 48%/32%/20% splits provided by Geom-GCN.

20 papers2 benchmarksGraphs

GEOM-DRUGS

GEOM-DRUGS is a dataset of 430,000 large organic molecules of up to 180 atoms from Axelrod and Gómez-Bombarelli, Nature Scientific Data, 2022.

20 papers3 benchmarks3D, Graphs

AGENDA (Abstract GENeration DAtaset)

Abstract GENeration DAtaset (AGENDA) is a dataset of knowledge graphs paired with scientific abstracts. The dataset consists of 40k paper titles and abstracts from the Semantic Scholar Corpus taken from the proceedings of 12 top AI conferences.

19 papers3 benchmarksGraphs

UMLS (Unified Medical Language System)

The Unified Medical Language System (UMLS) is a comprehensive resource that integrates and disseminates essential terminology, classification standards, and coding systems. Its purpose is to foster the creation of more effective and interoperable biomedical information systems and services, including electronic health records. Here are the key aspects of the UMLS:

19 papers2 benchmarksGraphs, Texts

Film (60%/20%/20% random splits)

Node classification on Film with 60%/20%/20% random splits for training/validation/test.

19 papers1 benchmarksGraphs

Squirrel (60%/20%/20% random splits)

Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.

19 papers1 benchmarksGraphs

Deezer-Europe

Node classification on Deezer Europe with 50%/25%/25% random splits for training/validation/test.

19 papers1 benchmarksGraphs

Squirrel (48%/32%/20% fixed splits)

Node classification on Squirrel with the fixed 48%/32%/20% splits provided by Geom-GCN.

19 papers2 benchmarksGraphs

Yeast

Yeast dataset consists of a protein-protein interaction network. Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology.

18 papers0 benchmarksBiology, Graphs

PubMed (60%/20%/20% random splits)

Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.

18 papers1 benchmarksGraphs

Wisconsin(60%/20%/20% random splits)

Node classification on Wisconsin with 60%/20%/20% random splits for training/validation/test.

18 papers1 benchmarksGraphs

PreviousPage 4 of 15Next