Datasets

285 machine learning datasets

285 dataset results

GDSC (Genomics of Drug Sensitivity in Cancer)

We have characterized 1000 human cancer cell lines and screened them with 100s of compounds. On this website, you will find drug response data and genomic markers of sensitivity.

4 papers1 benchmarksGraphs, Texts

PACE 2018 Steiner Tree

This is the set of instances use in the PACE 2018 competition, of optimal Steiner Tree computation. The instances are grouped into three tracks of 200 instances each, except for the third track which is only 199 instances. Each instance is an undirected graph.

4 papers0 benchmarksGraphs

VirtualHome2KG

VirtualHome2KG is a system for constructing and augmenting knowledge graphs (KGs) of daily living activities using virtual space. We also provide an ontology to describe the structure of the KGs. We used VirtualHome as a platform of virtual space simulation. Thus, this repository is an extension of the virtualhome. Please see the original repository of the virtualhome for details of the Unity simulation.

4 papers0 benchmarksGraphs

IBM Transactions for Anti Money Laundering

Money laundering is a multi-billion dollar issue. Detection of laundering is very difficult. Most automated algorithms have a high false positive rate: legitimate transactions incorrectly flagged as laundering. The converse is also a major problem -- false negatives, i.e. undetected laundering transactions. Naturally, criminals work hard to cover their tracks.

4 papers0 benchmarksGraphs, Time series

100style

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

4 papers0 benchmarksGraphs

OQMD v1.2 (The Open Quantum Materials Database)

The OQMD is a database of DFT calculated thermodynamic and structural properties of one million materials, created in Chris Wolverton's group at Northwestern University.

3 papers2 benchmarksGraphs

Arxiv GR-QC (General Relativity and Quantum Cosmology collaboration network)

Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to General Relativity and Quantum Cosmology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.

3 papers0 benchmarksGraphs

FB15k-237-low

The FB15k-237-low dataset is a variation of the FB15k-237 dataset where relations with a low number of triplets are kept.

3 papers0 benchmarksGraphs

Software Heritage Graph Dataset

Software Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits , coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on ''most starred'' repositories as it often happens.

3 papers0 benchmarksGraphs

UPFD-GOS (User Preference-aware Fake News Detection)

The Gossipcop variant of the UPFD dataset for benchmarking.

3 papers2 benchmarksGraphs, Texts

DIPS-Plus (The Enhanced Database of Interacting Protein Structures for Interface Prediction)

How and where proteins interface with one another can ultimately impact the proteins' functions along with a range of other biological processes. As such, precise computational methods for protein interface prediction (PIP) come highly sought after as they could yield significant advances in drug discovery and design as well as protein function analysis. However, the traditional benchmark dataset for this task, Docking Benchmark 5 (DB5), contains only a paltry 230 complexes for training, validating, and testing different machine learning algorithms. In this work, we expand on a dataset recently introduced for this task, the Database of Interacting Protein Structures (DIPS), to present DIPS-Plus, an enhanced, feature-rich dataset of 42,112 complexes for geometric deep learning of protein interfaces. The previous version of DIPS contains only the Cartesian coordinates and types of the atoms comprising a given protein complex, whereas DIPS-Plus now includes a plethora of new residue-level

3 papers0 benchmarks3D, Biomedical, Graphs

Deezer User Networks

The data was collected from the music streaming service Deezer (November 2017). These datasets represent friendship networks of users from 3 European countries. Nodes represent the users and edges are the mutual friendships. We reindexed the nodes in order to achieve a certain level of anonimity. The csv files contain the edges -- nodes are indexed from 0. The json files contain the genre preferences of users -- each key is a user id, the genres loved are given as lists. Genre notations are consistent across users. In each dataset users could like 84 distinct genres. Liked genre lists were compiled based on the liked song lists. The countries included are Romania, Croatia and Hungary. For each dataset we listed the number of nodes an edges.

3 papers0 benchmarksGraphs

DeepNets-1M

The DeepNets-1M dataset is composed of neural network architectures represented as graphs where nodes are operations (convolution, pooling, etc.) and edges correspond to the forward pass flow of data through the network. DeepNets-1M has 1 million training architectures and 1402 in-distribution (ID) and out-of-distribution (OOD) evaluation architectures: 500 validation and 500 testing ID architectures, 100 wide OOD architectures, 100 deep OOD architectures, 100 dense OOD architectures, 100 OOD archtectures without batch normalization, and 2 predefined architectures (ResNet-50 and 12 layer Visual Transformer).

3 papers0 benchmarksGraphs

SLNET (SLNET: A Redistributable Corpus of 3rd-party Simulink Models)

SLNET is collection of third party Simulink models. It is curated via mining open source repository (GitHub and Matlab Central) using SLNET-Miner (https://github.com/50417/SLNet_Miner).

3 papers0 benchmarksGraphs

The Little Prince (The Little Prince Corpus)

This corpus is an annotation of the novel The Little Prince by Antoine de Saint-Exupéry, published in 1943. We were inspired by the UNL project to include this novel, so that different groups could compare representations on the same text.

3 papers2 benchmarksGraphs, Texts

GlassTemp (Glass Transition Temperature)

The GlassTemp dataset is collected from Polyinfo. It uses monomers as polymer graphs to predict the property of glass transition temperature. The glass transition temperature of the material itself denotes the temperature range over which this glass transition takes place.

3 papers1 benchmarksGraphs

MIMIC-SPARQL

Question Answering (QA) is a widely-used framework for developing and evaluating an intelligent machine. In this light, QA on Electronic Health Records (EHR), namely EHR QA, can work as a crucial milestone toward developing an intelligent agent in healthcare. EHR data are typically stored in a relational database, which can also be converted to a directed acyclic graph, allowing two approaches for EHR QA: Table-based QA and Knowledge Graph-based QA.

3 papers0 benchmarksGraphs, Medical, Texts

DBP-5L (English)

DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used for the Knowledge Graph Completion and Entity Alignment task. DPB-5L (English) is a subset of DPB-5L with English KG.

3 papers4 benchmarksGraphs

DPB-5L (French)

3 papers4 benchmarksGraphs

Placenta

Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular communities cluster into heterogeneously distributed tissues of widely varying sizes (from 11 nodes to 44,671 nodes for a single structure).

3 papers1 benchmarksBiomedical, Graphs, Medical

PreviousPage 8 of 15Next