285 machine learning datasets
285 dataset results
In this work, we propose a novel remote sensing dataset, FireRisk, consisting of 7 fire risk classes with a total of 91 872 labelled images for fire risk assessment. This remote sensing dataset is labelled with the fire risk classes supplied by the Wildfire Hazard Potential (WHP) raster dataset, and remote sensing images are collected using the National Agriculture Imagery Program (NAIP), a high-resolution remote sensing imagery program. On FireRisk, we present benchmark performance for supervised and self-supervised representations, with Masked Autoencoders (MAE) pre-trained on ImageNet1k achieving the highest classification accuracy, 65.29%.
This package provides utilities for generation, filtering, solving, visualizing, and processing of mazes for training ML systems. Primarily built for the maze-transformer interpretability project. You can find our paper on it here: http://arxiv.org/abs/2309.10498
Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact
Cross-View Geo-Localisation within urban regions is challenging in part due to the lack of geo-spatial structuring within current datasets and techniques. We propose utilising graph representations to model sequences of local observations and the connectivity of the target location. Modelling as a graph enables generating previously unseen sequences by sampling with new parameter configurations. SpaGBOL contains 98,855 panoramic streetview images across different seasons, and 19,771 corresponding satellite images from 10 mostly densely populated international cities. This translates to 5 panoramic images and one satellite image per graph node. Downloading instructions below.
Border Gateway Protocol (BGP) Network describes the Internet's inter-domain structure, where nodes represent the autonomous systems and edges are the business relationships between nodes. The features contain basic properties, e.g., the location and topology information (e.g., transit degree), and labels means the types of autonomous systems.
The CTU Relational Learning Repository offers relational database datasets to the machine learning community. It currently hosts 148 SQL databases on a public MySQL server. A searchable meta-database provides key metadata, such as the number of tables, rows, columns, and self-relationships within each database.
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establi
PointPattern is a graph classification dataset constructed by simple point patterns from statistical mechanics. The authors simulated three point patterns in 2D: hard disks in equilibrium (HD), Poisson point process, and random sequential adsorption (RSA) of disks. The HD and Poisson distributions can be seen as simple models that describe the microstructures of liquids and gases while the RSA is a nonequilibrium stochastic process that introduces new particles one by one subject to nonoverlapping conditions.
The Toulouse Road Network dataset describes patches of road maps from the city of Toulouse, represented both as spatial graphs G = (A, X) and as grayscale segmentation images.
This is a Twitter dataset of 100,386 users along with up to 200 tweets from their timelines with a random-walk-based crawler on the retweet graph, with a subsample of 4,972 which is manually annotated as hateful or not through crowdsourcing. The dataset can be used to examine the difference between user activity patterns, the content disseminated between hateful and normal users, and network centrality measurements in the sampled graph.
The PolitiFact variant of the UPFD dataset for benchmarking.
Rent3D++ is an extension of the Rent3D floorplans + photos dataset. The floorplans are annotated with room outline polygons, doors/windows as line segments, object-icons as axis-aligned bounding boxes, room-door-room connectivity graphs, and photo-room assignments. We have extracted rectified surface crops from architectural surfaces in photos, and these can drive interior texturing/material modeling tasks. This dataset can be used with our paper Plan2Scene to generate textured 3D mesh models of houses using floorplans and photos.
DPPIN is a collection of dynamic networks, which consists of twelve generated dynamic protein-protein interaction networks of yeast cells, stored in twelve folders.
MetaVD is a Meta Video Dataset for enhancing human action recognition datasets. It provides human-annotated relationship labels between action classes across human action recognition datasets. MetaVD is proposed in the following paper: Yuya Yoshikawa, Yutaro Shigeto, and Akikazu Takeuchi. "MetaVD: A Meta Video Dataset for enhancing human action recognition datasets." Computer Vision and Image Understanding 212 (2021): 103276. [link]
ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
The Nations dataset is a small knowledge graph with 14 entities, 55 relations, and 1992 triples describing countries and their political relationships. This dataset is available for download from https://github.com/ZhenfengLei/KGDatasets.
From Schaub, Michael T., et al. "Random walks on simplicial complexes and the normalized hodge 1-laplacian." SIAM Review 62.2 (2020): 353-391.
Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.
The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed using User-Generated Content data collected from Flickr social media platform in three global cities containing UNESCO World Heritage property (Amsterdam, Suzhou, Venice). The motivation of data collection in this project is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.
The raw data are obtained from an industrial plant for ultra-processed food production. The sampling was carried out every 5 minutes while the total production cycle takes approximately 3 hours, from raw ingredients to final semi- finished products. The extracted data represent approximately 80 days of production. Variables 2 − 14 belonging to 4 specific phases of the process and influence the qualitative variable 17. Variables 15 and 16 are external variables not controlled by the process which affect the final product. It should also be noted that some variation may be due to changes in raw materials, in production flow (variable 1) or to possible reconfiguration between weeks. However while the magnitude of effects may change between weeks, the causal relationships are dictated by the plant and process dynamics and are consistent (at the best of potential un-cofounder and faults) throughout the production .