3,275 machine learning datasets
3,275 dataset results
We present two multi-modal datasets, one for Main Board IPOs, and the other for Small and Medium Enterprises (SME) IPOs. It consists of various features relating to the company going for IPOs, and other macroeconomic factors. The objective is to estimate the direction and under pricing with respect to opening, high and closing prices of stocks on the IPOlisting day.
Two versions of the dataset are offered: one is the full dataset used to train the models in our paper, and the other is a mini dataset for easier examination. Both versions include raw and postprocessed subsets of peeling, wiping and lifting. The raw videos of the tactile dataset used for generate the PCA embedding are also provided.
FBIS-22M is the largest field boundary instance segmentation dataset to date, featuring over 22 million labeled field instances across more than 672 000 high-resolution satellite image patches. It includes imagery from 0.25m to 10m resolution, sourced from multiple satellites and covering diverse geographic regions, enabling robust training for scalable agricultural vision models.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
SemanticSugarBeets, a novel and high-quality dataset containing 953 monocular RGB images and 2920 annotations of sugar beets, enables a wide range of learning tasks including object detection, semantic segmentation, instance segmentation and mass estimation for post-harvest and post-storage analysis.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This repository contains documentation for the dataset that accompanies our ICPE 2025 paper, "Shaved Ice: Optimal Compute Resource Commitments for Dynamic Multi-Cloud Workloads". It also includes example R and Python notebooks to read and visualize the data, including scripts to reproduce the figures and analysis results in the paper.
Smartphone cameras are ubiquitous in daily life, yet their performance can be severely impacted by dirty lenses, leading to degraded image quality. This issue is often overlooked in image restoration research, which assumes ideal or controlled lens conditions. To address this gap, we introduced SIDL (Smartphone Images with Dirty Lenses), a novel dataset designed to restore images captured through contaminated smartphone lenses. SIDL contains diverse real-world images taken under various lighting conditions and environments. These images feature a wide range of lens contaminants, including water drops, fingerprints, and dust. Each contaminated image is paired with a clean reference image, enabling supervised learning approaches for restoration tasks. To evaluate the challenge posed by SIDL, various state-of-the-art restoration models were trained and compared on this dataset. Their performances achieved some level of restoration but did not adequately address the diverse and reali
Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3
mwBTFreddy dataset is a resource developed to support flash flood damage assessment in urban Malawi, specifically focusing on the impacts of Cyclone Freddy in 2023. The dataset comprises paired pre- and post-disaster satellite images sourced from Google Earth Pro, ac- companied by JSON files containing labelled building annotations with geographic coordinates and damage levels (no damage, minor, major, or destroyed). Developed by the Kuyesera AI Lab at the Malawi University of Business and Applied Sciences, this dataset is intended to facilitate the develop- ment of machine learning models tailored to building detection and damage classification in African urban contexts. It also supports flood damage visualisation and spatial analysis to inform decisions on relocation, infrastructure planning, and emergency response in climate-vulnerable regions.
This dataset builds upon the SpaGBOL dataset - a graph-based dataset covering numerous cities across the globe for the purpose of structured city-scale Cross-View Geo-Localisation (CVGL).
The AneuX morphology database includes data from 3 different data sources: AneuX, @neurIST and Aneurisk. The AneuX data consists of two portions AneuX1 and AneuX2, which have extracted by two different teams of data curators.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This dataset includes 3D point-cloud and 2D imagery from a flash LiDAR...
A small-scale real-world dataset containing hazy/dusty industrial images and their clean ground truth counterparts. Designed for evaluating deep learning models for dust removal and image dehazing in industrial environments. Collected and fine-tuned by Moshtaghioun et al., 2025.
CAD-EdgeTune dataset is acquired using a Husarion ROSbot 2.0 and ROSbot 2.0 Pro with the collection speed set to 5 frames per second from a suburban university environment. We may split the information into subgroups for noon, dusk, and dawn in order to depict our surroundings under various lighting situations. We have assembled 17 sequences totaling 8080 frames, of which 1619 have been manually analyzed using an open-source pixel annotation program. Since nearby photographs are highly similar to one another, we decide to annotate every five images. Since the annotation procedure may be highly time-consuming, we employ soft-labeling while annotating CAD-EdgeTune dataset, which enables us to proceed through the frames more quickly. The annotation method we employ enables us to create minute annotations inside an image's objects, and the categorization would encompass related pixels. This approach may result in less-than-perfect annotations and some performance accuracy loss, but the los
Dataset Overview: 998 images and 4,208 annotations focusing on interaction with in-vehicle infotainment (IVI) systems. Key Features:
A dataset consisting of high-quality, synthetic chest X-rays from the CheXGenBench-benchmark leading model, Sana (0.6B). The dataset has been filtered to contain using High Quality samples using HealthGPT.
On Sunday, August 29, 2021, Hurricane Ida struck parts of Louisiana and Mississippi with wind gusts reaching up to 172 mph, leaving more than a million customers without electricity, including the entire New Orleans area. During the disaster, Maxar captured high spatial resolution satellite imagery (at 0.4 m/pixel) and was subsequently made publicly available. The original images were segmented into 512*512-pixel patches to maintain spatial context while enabling detailed analysis. From this process, we generated a dataset of 2,135 triplets, each containing a pre-disaster image, a post-disaster image, and a manually annotated damage categorical mask.
first everyday task dataset featuring COT outputs, diverse task designs, detailed re-plan processes, along with SFT and DPO sub-datasets.