285 machine learning datasets
285 dataset results
The AIDS Antiviral Screen dataset is a dataset of screens checking tens of thousands of compounds for evidence of anti-HIV activity. The available screen results are chemical graph-structured data of these various compounds.
The KACC benchmark consists of three subtasks that can be applied to knowledge graphs: knowledge abstraction, knowledge concretization and knowledge completion.
HAM is a dataset for molecular graph partitioning. This dataset contains coarse-grained (CG) mappings of 1206 organic molecules with less than 25 heavy atoms. Each molecule was downloaded from the PubChem database as SMILES. One molecule was assigned to two annotators to compare the human agreement between CG mappings. Downloaded SMILES were hand-mapped. The completed annotations were reviewed by a third person, to identify and remove unreasonable mappings (eg: one bead mappings) which did not agree with the given guidelines. Hence, there are 1.68 annotations per molecule in the current database (16% removed).
SidechainNet is a protein structure prediction dataset that directly extends ProteinNet. Specifically, SidechainNet adds measurements for protein angles and coordinates that describe the complete, all-atom protein structure (backbone and sidechain, excluding hydrogens) instead of the protein backbone alone.
TextWorld KG is a dynamic Knowledge Graph (KG) extraction dataset. It is based on a set of text-based games generated using. That framework allows to extract the underlying partial KG for every state, i.e., the subgraph that represents the agent’s partial knowledge of the world – what it has observed so far. All games share the same overarching theme: the agent finds itself hungry in a simple modern house with the goal of gathering ingredients and cooking a meal.
The FB1.5M dataset is a benchmark for Knowledge Graph Completion. It is based on Freebase and it contains 30 relations with less than 500 triplets as low-resource relations.
The PART-OF dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are parts of the human body. The dataset has 16,894 nodes with 19,436 edges between them.
This dataset is composed of paired videos of people dancing 3 different music styles: Ballet, Michael Jackson and Salsa. It contains multimodal data (visual data, temporal-graphs and audio) careful-selected from publicly available videos of dancers performing representative movements of the music style and audio data from the respective styles.
hERG is a large-scale biophysics federated molecular dataset related to cardiac toxicity. It consists of 10,572 compounds, with an average of 29.39 nodes and 94.09 edges in each graph.
JoCAD is a dataset for anomaly detection in citation networks.
HoaxItaly consists of over 1 million tweets shared during 2019 and containing links to thousands of news articles published on two classes of Italian outlets: (1) disinformation websites, i.e. outlets which have been repeatedly flagged by journalists and fact-checkers for producing low-credibility content such as false news, hoaxes, click-bait, misleading and hyper-partisan stories; (2) fact-checking websites which notably debunk and verify online news and claims. The dataset includes title and body for approximately 37k news articles.
YoutubeGraph-Dyn is an evolving graph dataset generated from YouTube real-world interactions. It can be used to study temporal evolution on graphs. YoutubeGraph-Dyn provides intra-day time granularity (with 416 snapshots taken every 6 hours for a period of 104 days), multi-modal relationships that capture different aspects of the data, multiple attributes including timestamped, non-timestamped, word embeddings, and integers.
ReviewRobot Dataset Overview This repository contains data for paper ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis. [Dataset]
Clickable heat-map visualizations of the experiments run to quantify the Classic ECN AQM problem and to evaluate the success of the Classic AQM Detection and Fall-back algorithm.
CTFW is a large annotated procedural text dataset in the cybersecurity domain (3154 documents). It is used to generate flow graphs from procedural texts.
The LSEC (Live Stream E-Commerce) dataset has two subsets: LSEC-Small and LSEC-Large. It is a dataset for studying E-commerce transactions in the context of live streams, where the streames are talking about products while interacting with their audience. The dataset consists of interaction information among streamers, users, and products.
Synthetic graph classification datasets with the task of recognizing the connectivity of same-colored nodes in 4 graphs of varying topology.
The dataset contains entities from IMDB, TheMovieDB and TheTVDB with goldstandard matches between the sources. Due to the licensing of IMDB we provide a script to build the IMDB part of the dataset yourself.
The original paper contains a high-level explanation of the dataset characteristics, and potential use cases of the dataset. ArchABM can help to quantify the impact of some of these building- and company policy-related measures.
GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate in. It consists of 806,136 triples with 21 relations and 89127 entities. GO21 can be used for knowledge graph completion tasks (link prediction) as well as hierarchical reasoning tasks, such as ancestor-descendant prediction task proposed in the paper.