285 machine learning datasets
285 dataset results
HiAML Computational Graph (CG) family introduced in "GENNAPE: Towards Generalized Neural Architecture Performance Estimators", accepted to AAAI-23. Contains 4.6k CIFAR-10 networks with an accuracy range of [91.11%, 93.44%].
Inception Computational Graph (CG) family introduced in "GENNAPE: Towards Generalized Neural Architecture Performance Estimators", accepted to AAAI-23. Contains 580 CIFAR-10 networks with an accuracy range of [89.08%, 94.03%].
Two-Path Computational Graph (CG) family introduced in "GENNAPE: Towards Generalized Neural Architecture Performance Estimators", accepted to AAAI-23. Contains 6.9k CIFAR-10 networks with an accuracy range of [85.53%, 92.34%].
This is the set of graphs used in the PACE 2022 challenge for computing the Directed Feedback Vertex Set, from the Exact track. It consists of 200 labelled directed graphs. The graphs range in size up to from N=512 up to N=131072 vertices, and up to 1315170 edges. The graphs are mostly not symmetric (an edge form u->v does not imply an edge from v->u), although some are symmetric. The graph labels are integers ranging from 1 to N.
This is the dataset used in the PACE 2016 challenge, Track B, which was computing minimal Feedback Vertex Set. This competition focused on exact solutions, i.e. provably minimal feedback vertex sets (and no heuristic solutions). This should not be confused with the PACE 2022 challenge, which focused on directed feedback vertex set, and has its own entries on PapersWithCode (exact and heuristic).
This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.
The dataset includes two parts corresponding to the cities of Abakan (65524 nodes, 340012 edges) and Omsk (231688 nodes, 1149492 edges). Along with the road network graph, it includes trip records represented as sequences of visited nodes (making the dataset suitable both for path-blind and path-aware settings). There are two types of target values for a regression task: real travel time and real length of a trip.
An RDF knowledge graph that provides comprehensive, current information about almost 400,000 machine learning publications. This includes the tasks addressed, the datasets utilized, the methods implemented, and the evaluations conducted, along with their results. Compared to its non-RDF-based counterpart Papers With Code, LPWC not only translates the latest advancements in machine learning into RDF format, but also enables novel ways for scientific impact quantification and scholarly key content recommendation. LPWC is openly accessible and is licensed under CC-BY-SA 4.0. As a knowledge graph in the Linked Open Data cloud, we offer LPWC in multiple formats, from RDF dump files to a SPARQL endpoint for direct web queries, as well as a data source with resolvable URIs and links to the data sources SemOpenAlex, Wikidata, and DBLP. Additionally, we supply knowledge graph embeddings, enabling LPWC to be readily applied in machine learning applications.
KGRC-RDF-star is an RDF-star dataset converted from KGRC-RDF, which is a Knowledge graph dataset of novel stories.
Wyze Rule Recommendation Dataset. It is a big dataset with 300,000 users. Please cite [1] if you used the dataset and cite [2] if you referenced the algorithm.
The CHILI-3K dataset is a medium-scale graph dataset (with overall >6M nodes, >49M edges) of mono-metallic oxide nanomaterials generated from 12 selected crystal types. This dataset has a narrow chemical scope focused on an interesting part of chemical space with a lot of active research.
The CHILI-100K dataset is a large-scale graph dataset (with overall >183M nodes, >1.2B edges) of nanomaterials generated from experimentally determined crystal structures. The crystal structures used in CHILI-100K are obtained from a curated subset from the Crystallography Open Database (COD) and has a broad chemical scope covering database entries for 68 metals and 11 non-metals.
Source
We introduce Dynamic Text-attributed Graph Benchmark (DTGB), a collection of large-scale, time-evolving graphs from diverse domains, with nodes and edges enriched by dynamically changing text attributes and categories. To facilitate the use of DTGB, we design standardized evaluation procedures based on four real-world use cases: future link prediction, destination node retrieval, edge classification, and textual relation generation. These tasks require models to understand both dynamic graph structures and natural language, highlighting the unique challenges posed by DyTAGs.
Search Engine Optimization (SEO) attributes provide strong signals for predicting news site reliability. We introduce a novel attributed webgraph dataset with labeled news domains and their connections to outlinking and backlinking domains. Finally, we introduce and evaluate a novel graph-based algorithm for discovering previously unknown misinformation news sources.
GVLQA is the first vision-language QA dataset for general graph reasoning. Contains a base set GVLQA-BASE and four image-augmented subsets GVLQA-AUGLY, GVLQA-AUGNO, GVLQA-AUGNS, GVLQA-AUGET, where the samples are relatively corresponding with the base set. Contains 7 graph reasoning tasks: detecting cycle, connectivity, computing topological ordering, shortest path, maximum flow, bipartite matching num, and Hamilton path. Utility: 1) evaluate the graph reasoning capabilities of VLMs or LLMs; 2) help models acquire fundamental graph comprehension and reasoning abilities as a pretraining dataset.
NBA: This is extended from a Kaggle dataset * containing around 400 NBA basketball players. The performance statistics of players in the 2016-2017 season and other various information e.., nationality, age, and salary are provided. To obtain the graph that links the NBA players together, we collect the relationships of the NBA basketball players on Twitter with its official crawling API 2. We binarize the nationality to two categories, i.e., U.S. players and oversea players, which is used as sensitive attribute. The classification task is to predict whether the salary of the player is over median.
This data was collected by performing a breadth-first search on the user-product-review graph until termination, meaning that it is a fairly comprehensive collection of English-language product data. We split the full dataset into top-level categories, e.g. Books, Movies, Music. We do this mainly for practical reasons, as it allows each model and dataset to fit in memory on a single machine (requiring around 64GB RAM and 2-3 days to run our largest experiment). Note that splitting the data in this way has little impact on performance, as there are few links that cross top-level categories, and the hierarchical nature of our model means that few parameters are shared across categories.
In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each task further contains three subtasks, with each subtask comprising 5,000 test samples. Given the inherent complexity of open molecule generation, we have also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations and potential areas for improvement in text-guided molecule discovery. Furthermore, with the assistance of OpenMolIns, a specialized instruction tuning dataset proposed for solving challenges raised by TOMG-Bench, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-tu
ApisTox contains molecules in SMILES format for predicting pesticides toxicity to honey bees.