285 machine learning datasets
285 dataset results
AutoFR Dataset is broken down by each site that we crawl within a zip file. It contains multiple different experiments that we conducted in our paper. The overall dataset contains 1042 sites that we crawled where we detected ads within the Top-5K.
FLORIS farm dataset A dataset for graph neural network modeling of wind farms. The current version of the dataset contains two farms, with very different geometry but similar inter-turbine statistics. The wind farms were simulated with the steady-state wake model FLORIS.
This file contains the data and code for the publication "The Federal Reserve's Response to the Global Financial Crisis and Its Long-Term Impact: An Interrupted Time-Series Natural Experimental Analysis" by A. C. Kamkoum, 2023.
This dataset contains information on application install interactions of users in the Myket android application market. The dataset was created for the purpose of evaluating interaction prediction models, requiring user and item identifiers along with timestamps of the interactions. Hence, the dataset can be used for interaction prediction and building a recommendation system. Furthermore, the data forms a dynamic network of interactions, and we can also perform network representation learning on the nodes in the network, which are users and applications.
We introduce USPTO-30K, a large-scale benchmark dataset of annotated molecule images, which overcomes these limitations. It is created using the pairs of images and MolFiles by the United States Patent and Trademark Office. Each molecule was independently selected among all the available documents from 2001 to 2020. The set consists of three subsets to decouple the study of clean molecules, molecules with abbreviations and large molecules.
The set is created using molecule SMILES retrieved from the database PubChem. Images are then generated from SMILES using the molecule drawing library RDKit. The synthetic set is augmented at multiple levels:
Dataset introduced by Xifeng Yan et al.
Dataset introduced by Xifeng Yan et al.
IMCPT-SparseGM dataset is a new visual graph matching benchmark addressing partial matching and graphs with larger sizes, based on the novel stereo benchmark Image Matching Challenge PhotoTourism (IMC-PT) 2020. This dataset is released in CVPR 2023 paper Deep Learning of Partial Graph Matching via Differentiable Top-K.
IMCPT-SparseGM dataset is a new visual graph matching benchmark addressing partial matching and graphs with larger sizes, based on the novel stereo benchmark Image Matching Challenge PhotoTourism (IMC-PT) 2020. This dataset is released in CVPR 2023 paper Deep Learning of Partial Graph Matching via Differentiable Top-K.
Description This repository includes the experiment results, source code, and test data for Three Cs risk inference, using the CIRO (COVID-19 Infection Risk Ontology) and HermiT.
This repository is an extension of GEval. This repository contains a (software) evaluation framework to perform evaluation and comparison on RDF-star graph embedding techniques. The gold standard datasets for evaluation were created from KGRC-RDF-star. Please see here.
Multi-Modal Hate Speech Detection with Graph Context.
Genre annotations for movies The file genre2movies.csv contains genre-movie tuples based on Wikidata annotations (https://www.wikidata.org/).
This dataset accompanies the paper `Learning the mechanisms of network growth' by the same authors. The dataset contains 6733 networks of size 20,000 each generated in accordance to different combination of three mechanisms: fitness, aging and preferential attachment. The goal is to use machine learning to identify the combination of mechanisms that was used to create the network. The dataset includes static features from the literature and two version of our newly developed dynamic features. net
This benchmark hypergraph dataset, Twitter-HyDrug-UR, is derived from Twitter-HyDrug by HyGCL-DC. Twitter-HyDrug-UR is a real-world hypergraph data that describes the drug trafficking on Twitter. Unlike HyGCL-DC, which targets a drug trafficking community detection task (a multi-label node classification), we aim to identify drug user roles in drug trafficking activities on social media. To this end, we categorize node labels into four distinct roles: drug seller, drug buyer, drug user, and drug discussant, and each node is assigned to one and only one label. Consequently, we frame the problem for Twitter-HyDrug-UR as a multi-class node classification task.
Inpatient claims, Outpatient claims and Beneficiary details of each provider.
Download free fonts in DaFont style from our extensive collection. Find bold, italic, cursive, futuristic fonts, and more. Enhance your projects with unique and stylish typography today!
The MAPLE benchmark constructed by us contains 20 datasets across 19 fields for scientific literature tagging. It also has a graph format, which can be used for graph mining tasks (e.g., node classification, link prediction). Refer to its homepage for more details.
HALvest-Geometric is a subset of HALvest: an academic citation network with 238,397 disambiguated authors and 18,662,037 scholarly papers.