285 machine learning datasets
285 dataset results
The Modified Swiss Dwellings (MSD) dataset is an ML-ready dataset for floor plan generation and analysis at building-level scale. The MSD dataset is completely derived from the Swiss Dwellings database (v3.0.0). The MSD dataset contains highly-detailed 5372 floor plans of single- as well as multi-unit building complexes across Switzerland, hence extending the building scale w.r.t. of other well know floor plan datasets like the RPLAN dataset.
The Room environment - v2
a high-level explanation of the dataset characteristics We introduce WikiOFGraph, a novel large-scale, domain-diverse dataset synthesized by LLMs, ensuring superior graph-text consistency to advance general-domain graph-to-text generation.
The Perfume Co-Preference Network dataset comprises comprehensive user reviews and ratings collected from the Persian retail platform Atrafshan. This dataset, central to our research on community detection in fragrance preferences, includes 36,434 comments from 7,387 unique users, providing insights into consumer sentiment towards various perfumes. It is designed to facilitate the analysis of user preferences through sentiment analysis, allowing for the clustering of perfumes based on shared attributes.
Abstract: Graph Neural Networks (GNNs) have recently gained traction in transportation, bioinformatics, language and image processing, but research on their application to supply chain management remains limited. Supply chains are inherently graph-like, making them ideal for GNN methodologies, which can optimize and solve complex problems. The barriers include a lack of proper conceptual foundations, familiarity with graph applications in SCM, and real-world benchmark datasets for GNN-based supply chain research. To address this, we discuss and connect supply chains with graph structures for effective GNN application, providing detailed formulations, examples, mathematical definitions, and task guidelines. Additionally, we present a multi-perspective real-world benchmark dataset from a leading FMCG company in Bangladesh, focusing on supply chain planning. We discuss various supply chain tasks using GNNs and benchmark several state-of-the-art models on homogeneous and heterogeneous grap
This repository contains three graph datasets for the UE traffic assignment problem on Sioux-Falls, Eastern-Massachusetts and Anaheim networks in both dgl and pyg formats. The datasets are generated and used to train and evaluate models for solving the User Equilibrium (UE) problem on three transportation networks:
FinDKG: The Global Financial Dynamic Knowledge Graph Dataset FinDKG is an open-source dataset focused on creating a temporally-resolved Financial Dynamic Knowledge Graph. Designed to bridge the gap in industry-specific knowledge graphs, particularly in the financial sector, FinDKG provides a high-touch, temporally-aware representation of global economic and market dynamics. This repository includes comprehensive details about the dataset, methodology, and schema, aiming to facilitate academic research and actionable insights in global financial markets.
Source: Linking Datasets on Organizations Using Half-a-Billion Open-Collaborated Records (Description (Markdown and LATEX enabled))
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
CompMix-IR Dataset Overview:
Dataset Description The dataset used in this study comprises bug reports extracted from the Visual Studio Code GitHub repository, specifically focusing on those labeled with the english-please tag. This label indicates that the original submission was written in a language other than English, providing a clear signal for multilingual content. The dataset spans a five-year period (March 2019--June 2024), ensuring a diverse representation of bug types, user environments, and technical contexts.
This repository contains data for a research project involving graph neural networks (GNNs) applied to mechanical metamaterials and their deformations.
GeoJEPAD is a multimodal dataset combining OpenStreetMap (OSM) data (attributes and geometries) with high-resolution aerial imagery from diverse urban areas. Sourced from NAIP and OSM and then processed, tiled, and cropped. Geometries and relations represented as graphs with optional visibility edges.
The Building TimeSeries (BTS) dataset covers three buildings over a three-year period, comprising more than ten thousand timeseries data points with hundreds of unique ontologies. Moreover, the metadata is standardised in the formed of knowledge graph using the Brick schema.
This repository contains documentation for the dataset that accompanies our ICPE 2025 paper, "Shaved Ice: Optimal Compute Resource Commitments for Dynamic Multi-Cloud Workloads". It also includes example R and Python notebooks to read and visualize the data, including scripts to reproduce the figures and analysis results in the paper.
ATC-GRAPH is the most extensive ATC benchmark dataset. All drugs in the benchmarks are linked to their Mol files instead of the SMILES sequences utilized in earlier benchmarks. This shift allows for more precise and detailed modeling and learning. In terms of scale, ATC-GRAPH surpasses Chen-2012 and ATC-SMILES by 36.78% and 16.85%, respectively. Significantly, ATC-GRAPH was curated through a cross-validation process involving multiple resources such as KEGG, PubChem, ChEMBL, ChemSpider, and ChemicalBook. This results in ATC-GRAPH being distinguished by its timeliness and comprehensive coverage across all five levels and drug genres.
This dataset builds upon the SpaGBOL dataset - a graph-based dataset covering numerous cities across the globe for the purpose of structured city-scale Cross-View Geo-Localisation (CVGL).
B-XAIC consists of 50K small molecules represented as graphs and includes 7 graph classification tasks, each with ground truth labels and corresponding explanations.
This paper constructs 7-digit product Supply-Use Tables (SUTs) and symmetric Input-Output Tables (IOTs) for the Indian economy using microdata from the Annual Survey of Industries (ASI) for the period 2016-2021. We outline the methodology for generating input flows and reconciling registered and unregistered sector data via NPCMS-NIC concordance. The transition from SUTs to IOTs is explained using the Industry Technology Assumption. We apply this framework to analyse the economic impact—specifically Domestic Value Added (DVA) and employment influenced by production and exports. A case study of India's mobile phone sector reveals significant output growth, import substitution, an increase in exports, a shift in DVA/FVA shares, notable employment growth, with a leaning towards contractual labour, and increased female participation. These tables are valuable for analysing sectoral interdependencies and industrial policy effectiveness in India.
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used for the Knowledge Graph Completion and Entity Alignment task. DPB-5L (Japanese) is a subset of DPB-5L with Japanese KG.