Datasets

298 machine learning datasets

298 dataset results

Songdo Traffic (Songdo Traffic: High Accuracy Georeferenced Vehicle Trajectories from a Large-Scale Study in a Smart City)

The Songdo Traffic dataset delivers precisely georeferenced vehicle trajectories captured through high-altitude bird's-eye view (BeV) drone footage over Songdo International Business District, South Korea. Comprising approximately 700,000 unique trajectories, this resource represents one of the most extensive aerial traffic datasets publicly available, distinguishing itself through exceptional temporal resolution that captures vehicle movements at 29.97 points per second, enabling unprecedented granularity for advanced urban mobility analysis.

1 papers0 benchmarksImages, Tabular, Time series, Tracking, Videos

SKF-BLS Dataset (SKF Heterogeneous Test-rig Bearing Load Sensing Dataset)

This dataset is developed to estimate bearing loads under various operating conditions (rotational speed, axial and radial loads) using data from temperature and vibration sensors. These sensor modalities provide complementary information: vibration signals indicate the magnitude of the load, while temperature measurements reveal the spatial distribution of the load within the bearing. The dataset emulates a real-world deployment scenario of a virtual sensor, mirroring scenarios where a physical sensor's operational life is limited, such as when a sensor roller collecting field data experiences premature battery depletion. It contains 55 unique operating conditions, defined by axial load (Fx), radial load (Fy), and rotational speed.

1 papers0 benchmarksTime series

BTS (Building Timeseries Dataset: Empowering Large-Scale Building Analytics)

The Building TimeSeries (BTS) dataset covers three buildings over a three-year period, comprising more than ten thousand timeseries data points with hundreds of unique ontologies. Moreover, the metadata is standardised in the formed of knowledge graph using the Brick schema.

1 papers0 benchmarksGraphs, Time series

Gaze-CIFAR-10

We construct Gaze-CIFAR-10, a gaze-augmented image dataset based on the standard CIFAR-10 benchmark, enhanced with human eye-tracking annotations collected using the HTC VIVE Pro Eye headset. The original CIFAR-10 dataset consists of 60,000 color images across 10 categories, each with a resolution of $32 \times 32$ pixels. To enable reliable human gaze tracking, all images are upsampled to $1024 \times 1024$ using the Real-ESRGAN model.

1 papers1 benchmarksImages, Time series, Tracking

RoBo6

Dataset contains light curves of 6 rocket body types from Mini Mega Tortora database (MMT)[^1]. The dataset was created to be used as a benchmark for rocket body light curve classification. For more informations follow the original paper: RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification[^2]

1 papers0 benchmarksTime series

DPLink-ISP-Shanghai

dataset in WWW 2019 "DPLink: User Identity Linkage via Deep Neural Network From Heterogeneous Mobility Data". This data is intended for academic use only. Redistribution of this data is not permitted without our explicit permission.

1 papers0 benchmarksTime series

Discrete-Time Modeling of Interturn Short Circuits in Interior PMSMs - Data and Models

Project: Discrete-Time Modeling of Interturn Short Circuits in Interior PMSMs

1 papers0 benchmarksTime series

Shaved Ice Snowflake VM Demand Dataset (Snowflake Dataset for "Shaved Ice: Optimal Compute Resource Commitments for Dynamic Multi-Cloud Workloads" paper)

This repository contains documentation for the dataset that accompanies our ICPE 2025 paper, "Shaved Ice: Optimal Compute Resource Commitments for Dynamic Multi-Cloud Workloads". It also includes example R and Python notebooks to read and visualize the data, including scripts to reproduce the figures and analysis results in the paper.

1 papers0 benchmarksGraphs, Images, Time series

DARai (Daily Activity Recordings for AI and ML applications)

Daily Activity Recordings for Artificial Intelligence (DARai, pronounced "Dahr-ree") is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3

1 papers0 benchmarksBiomedical, Environment, Images, LiDAR, RGB-D, Time series, Videos

MODIS AOD (imputed) (Pre-processed MODIS AOD and ERA5 data (2003-2022) for North Africa)

Structured atmospheric data for AI/ML Long-term, pre-processed, atmospheric datasets for use in Machine Learning/AI based forecasting. Initially intended to predict AOD, however can be adapted for prediction of other atmospheric particles.

1 papers0 benchmarks3D, Environment, Physics, Time series

CSTS (Correlation Structures in Time Series)

CSTS: Correlation Structures in Time Series CSTS is a comprehensive synthetic benchmarking dataset designed specifically for evaluating correlation structure discovery in time series data. The dataset systematically models known correlation structures between time series variables and enables rigorous assessment of clustering algorithms and validation methods.

1 papers0 benchmarksTabular, Time series

BOOM (Benchmark of Observability Metrics)

BOOM (Benchmark of Observability Metrics) is a large-scale, real-world time series dataset designed for evaluating models on forecasting tasks in complex observability environments. Composed of real-world metrics data collected from Datadog, a leading observability platform, the benchmark captures the irregularity, structural complexity, and heavy-tailed statistics typical of production observability data. Unlike synthetic or curated benchmarks, BOOM reflects the full diversity and unpredictability of operational signals observed in distributed systems, covering infrastructure, networking, databases, security, and application-level metrics.

1 papers0 benchmarksTime series

TimeGraph (TimeGraph: Synthetic Benchmark Datasets for Robust Time-Series Causal Discovery)

TimeGraph is a comprehensive suite of synthetic datasets designed to benchmark causal discovery algorithms on time-series data. The dataset captures real-world complexities by incorporating temporal dynamics such as trends, seasonality, and nonstationarity, as well as sampling challenges including irregular time intervals and structured missingness. It features diverse noise types, including Gaussian, heavy-tailed, and heteroskedastic variations, and supports scenarios with latent confounding to enable evaluation under partially observed systems. The underlying causal structures span both linear and nonlinear relationships, including polynomial and trigonometric forms.

1 papers0 benchmarksTabular, Time series

Interaction Dataset of Autonomous Vehicles with Traffic Lights and Signs (Interaction Data of Autonomous Vehicles with Traffic Lights and Signs Based on Waymo Motion Open Dataset)

This dataset is derived from the Waymo Motion dataset and focuses on capturing the interactions between autonomous vehicles (AVs) and traffic control devices such as traffic lights and stop signs. It addresses a critical gap by providing real-world trajectory data that reflects how AVs interpret and respond to traffic control signals, supporting research in AV behavior modeling, traffic simulation, and the design of intelligent transportation systems.

1 papers0 benchmarksTime series, Tracking

Dataset for Cell-to-Cell Communications

The dataset evaluates the number of vesicles observed in a Tcell that is close to a tumor cell. It stores the number of vesicles over time and for various distances between the Tcell and the tumor cell. The dataset is a matrix stored in a "mat" file (Matlab), where each row represents a distance and each column corresponds to a time index. The distance ranges from 2 to 10 micrometers, and the time varies from 0 to approximately 160 minutes. We also provide the code to generate this dataset and to plot the corresponding curves.

1 papers0 benchmarksTime series

Hitchhiking Rides Dataset

Here the dataset described in Hitchhiking Rides Dataset: Two decades of crowd-sourced records on stochastic traveling(https://arxiv.org/abs/2506.21946) is published.

1 papers0 benchmarksTabular, Texts, Time series

LoRaWAN Path Loss Measurements in an Indoor Office Setting including Environmental Factors/Conditions

This dataset was collected during a LoRaWAN measurement campaign in a multi-room indoor office environment at the University of Siegen, Germany. It contains over 1.7 million time-stamped records from 6 LoRaWAN nodes transmitting once per minute to a single gateway. Each record includes environmental parameters: temperature, relative humidity, barometric pressure, particulate matter (PM2.5), and carbon dioxide (CO₂); as well as device metadata such as RSSI, SNR, spreading factor (SF), etc. The dataset also includes the effective signal power (ESP) and the noise (NP) for LoRaWAN propagation analysis purposes. The dataset is designed to support research on indoor wireless propagation, distance estimation, environment-aware modeling, among other IoT use cases and applications in line with the 6G flagship demands.

1 papers0 benchmarksTime series

7-digit Product-level Supply-Use and Input-Output Tables Using ASI Data

This paper constructs 7-digit product Supply-Use Tables (SUTs) and symmetric Input-Output Tables (IOTs) for the Indian economy using microdata from the Annual Survey of Industries (ASI) for the period 2016-2021. We outline the methodology for generating input flows and reconciling registered and unregistered sector data via NPCMS-NIC concordance. The transition from SUTs to IOTs is explained using the Industry Technology Assumption. We apply this framework to analyse the economic impact—specifically Domestic Value Added (DVA) and employment influenced by production and exports. A case study of India's mobile phone sector reveals significant output growth, import substitution, an increase in exports, a shift in DVA/FVA shares, notable employment growth, with a leaning towards contractual labour, and increased female participation. These tables are valuable for analysing sectoral interdependencies and industrial policy effectiveness in India.

1 papers0 benchmarksGraphs, Images, Tabular, Texts, Time series

ForeDeCk

ForeDeCk is a time series database compiled at the National Technical University of Athens that contains 900,000 continuous time series, built from multiple, diverse and publicly accessible sources. ForeDeCk emphasizes business forecasting applications, including series from relevant domains such as industries, services, tourism, imports & exports, demographics, education, labor & wage, government, households, bonds, stocks, insurances, loans, real estate, transportation, and natural resources & environment.

0 papers0 benchmarksTime series

Bach Chorales

Bach chorales is a univariate time series based on chorales, where the task is to learn generative grammar. The dataset consists of single-line melodies of 100 Bach chorales (originally 4 voices). The melody line can be studied independently of other voices. The grand challenge is to learn a generative grammar for stylistically valid chorales.

0 papers0 benchmarksAudio, Time series

PreviousPage 14 of 15Next