Datasets

298 machine learning datasets

298 dataset results

HOMER (Household Object Movements from Everyday Routines)

The Household Object Movements from Everyday Routines (HOMER) dataset is composed of routine behaviors for five households, spanning 50 days for the train split and 10 days for test split. The households are based on an identical apartment setting with four rooms and 108 objects and 33 atomic actions such as find, grab, etc.

4 papers0 benchmarks3D, Time series

The tourism forecasting competition

The data we use include 366 monthly series, 427 quarterly series and 518 yearly series. They were supplied by both tourism bodies (such as Tourism Australia, the Hong Kong Tourism Board and Tourism New Zealand) and various academics, who had used them in previous tourism forecasting studies (please refer to the acknowledgements and details of the data sources and availability).

4 papers0 benchmarksTime series

IBM Transactions for Anti Money Laundering

Money laundering is a multi-billion dollar issue. Detection of laundering is very difficult. Most automated algorithms have a high false positive rate: legitimate transactions incorrectly flagged as laundering. The converse is also a major problem -- false negatives, i.e. undetected laundering transactions. Naturally, criminals work hard to cover their tracks.

4 papers0 benchmarksGraphs, Time series

KU-HAR

Human Activity Recognition (HAR) refers to the capacity of machines to perceive human actions. This dataset contains information on 18 different activities collected from 90 participants (75 male and 15 female) using smartphone sensors (Accelerometer and Gyroscope). It has 1945 raw activity samples collected directly from the participants, and 20750 subsamples extracted from them. The activities are:

4 papers0 benchmarksTabular, Time series

ECG-Image-Database (Digitization and Classification of ECG Images: The George B. Moody PhysioNet Challenge 2024)

The George B. Moody PhysioNet Challenges are annual competitions that invite participants to develop automated approaches for addressing important physiological and clinical problems. The 2024 Challenge invites teams to develop algorithms for digitizing and classifying electrocardiograms (ECGs) captured from images or paper printouts. Despite the recent advances in digital ECG devices, physical or paper ECGs remain common, especially in the Global South. These physical ECGs document the history and diversity of cardiovascular diseases (CVDs), and algorithms that can digitize and classify these images have the potential to improve our understanding and treatment of CVDs, especially for underrepresented and underserved populations.

4 papers1 benchmarksTime series

NinaPro DB2 (DB2 - 40 Intact Subjects - Delsys Trigno electrodes)

The second Ninapro database includes 40 intact subjects and it is thoroughly described in the paper: "Manfredo Atzori, Arjan Gijsberts, Claudio Castellini, Barbara Caputo, Anne-Gabrielle Mittaz Hager, Simone Elsig, Giorgio Giatsidis, Franco Bassetto & Henning Müller. Electromyography data for non-invasive naturally-controlled robotic hand prostheses. Scientific Data, 2014" (http://www.nature.com/articles/sdata201453). Please, cite this paper for any work related to the Ninapro database. Please, use also the paper by Gijsberts et al., 2014 (http://publications.hevs.ch/index.php/publications/show/1629) for more information about the database.

3 papers0 benchmarksBiomedical, Medical, Time series

MIT-BIH AFDB (MIT-BIH Atrial Fibrilation Database)

This database includes 25 long-term ECG recordings of human subjects with atrial fibrillation (mostly paroxysmal).

3 papers0 benchmarksMedical, Time series

PWDB (Pulse Wave Database)

Overview This database of simulated arterial pulse waves is designed to be representative of a sample of pulse waves measured from healthy adults. It contains pulse waves for 4,374 virtual subjects, aged from 25-75 years old (in 10 year increments). The database contains a baseline set of pulse waves for each of the six age groups, created using cardiovascular properties (such as heart rate and arterial stiffness) which are representative of healthy subjects at each age group. It also contains 728 further virtual subjects at each age group, in which each of the cardiovascular properties are varied within normal ranges. This allows for extensive in silico analyses of haemodynamics and the performance of pulse wave analysis algorithms.

3 papers0 benchmarksBiology, Biomedical, Medical, Time series

SKAB (Skoltech Anomaly Benchmark)

SKAB is designed for evaluating algorithms for anomaly detection. The benchmark currently includes 30+ datasets plus Python modules for algorithms’ evaluation. Each dataset represents a multivariate time series collected from the sensors installed on the testbed. All instances are labeled for evaluating the results of solving outlier detection and changepoint detection problems.

3 papers4 benchmarksTables, Time series

BRUSH (Brown University Stylus Handwriting)

The BRUSH dataset (BRown University Stylus Handwriting) contains 27,649 online handwriting samples from a total of 170 writers. Every sequence is labeled with intended characters such that dataset users can identify to which character a point in a sequence corresponds. The dataset was introduced in the paper "Generating Handwriting via Decoupled Style Descriptors" by Atsunobu Kotani, Stefanie Tellex, James Tompkin from Brown University, presented at European Conference on Computer Vision (ECCV) 2020.

3 papers0 benchmarksTime series

TrajAir: A General Aviation Trajectory Dataset

This dataset contains aircraft trajectories in an untowered terminal airspace collected over 8 months surrounding the Pittsburgh-Butler Regional Airport [ICAO:KBTP], a single runway GA airport, 10 miles North of the city of Pittsburgh, Pennsylvania. The trajectory data is recorded using an on-site setup that includes an ADS-B receiver. The trajectory data provided spans days from 18 Sept 2020 till 23 Apr 2021 and includes a total of 111 days of data discounting downtime, repairs, and bad weather days with no traffic. Data is collected starting at 1:00 AM local time to 11:00 PM local time. The dataset uses an Automatic Dependent Surveillance-Broadcast (ADS-B) receiver placed within the airport premises to capture the trajectory data. The receiver uses both the 1090 MHz and 978 MHz frequencies to listen to these broadcasts. The ADS-B uses satellite navigation to produce accurate location and timestamp for the targets which is recorded on-site using our custom setup. Weather data during t

3 papers2 benchmarksTabular, Time series

Spectrum Challange 2 Dataset

The dataset is approved for public release, distribution unlimited.

3 papers0 benchmarksTime series

HiRID

HiRID is a freely accessible critical care dataset containing data relating to almost 34 thousand patient admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland (ICU), an interdisciplinary 60-bed unit admitting >6,500 patients per year. The ICU offers the full range of modern interdisciplinary intensive care medicine for adult patients. The dataset was developed in cooperation between the Swiss Federal Institute of Technology (ETH) Zürich, Switzerland and the ICU.

3 papers11 benchmarksMedical, Time series

ExtMarker (3D motion of chest external markers)

Three-dimensional position of external markers placed on the chest and abdomen of healthy individuals breathing during intervals from 73s to 222s. The markers move because of the respiratory motion, and their position is sampled at approximately 10Hz. Markers are metallic objects used during external beam radiotherapy to track and predict the motion of tumors due to breathing for accurate dose delivery.

3 papers15 benchmarksMedical, Time series

METR-LA Point Missing

The original dataset from Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting contains traffic readings collected from 207 loop detectors on highways in Los Angeles County, aggregated in 5 minutes intervals over four months between March 2012 and June 2012.

3 papers1 benchmarksTime series

PEMS-BAY Point Missing

The original dataset from Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting contains 6 months of traffic readings from 01/01/2017 to 05/31/2017 collected every 5 minutes by 325 traffic sensors in San Francisco Bay Area. The measurements are provided by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS).

3 papers1 benchmarksTime series

Apnea-ECG (PhysioNet Apnea-ECG Database)

The data consist of 70 records, divided into a learning set of 35 records (a01 through a20, b01 through b05, and c01 through c10), and a test set of 35 records (x01 through x35), all of which may be downloaded from this page. Recordings vary in length from slightly less than 7 hours to nearly 10 hours each. Each recording includes a continuous digitized ECG signal, a set of apnea annotations (derived by human experts on the basis of simultaneously recorded respiration and related signals), and a set of machine-generated QRS annotations (in which all beats regardless of type have been labeled normal). In addition, eight recordings (a01 through a04, b01, and c01 through c03) are accompanied by four additional signals (Resp C and Resp A, chest and abdominal respiratory effort signals obtained using inductance plethysmography; Resp N, oronasal airflow measured using nasal thermistors; and SpO2, oxygen saturation).

3 papers9 benchmarksMedical, Time series

Extreme Events > Natural Disasters > Hurricane (Tourism > Finance > Sales Revenue)

A new spatio-temporal benchmark dataset (Hurricane), is suited for forecasting during extreme events and anomalies. The dataset is provided through the Florida Department of Revenue which provides the monthly sales revenue (2003-2020) for the tourism industry for all 67 counties of Florida which are prone to annual hurricanes. Furthermore, we aligned and joined the raw time series with the history of hurricane categories (i.e., event intensities) based on time for each county. Note that the hurricane category indicates the maximum sustained wind speed which can result in catastrophic damages as this number goes up (Category 1-6).

3 papers2 benchmarksTabular, Time series

AugMod (AugMod: pythagore-mod-reco)

Context A radio signal consists in two channels, channel I (for 'In phase') and channel Q (for 'Quadrature') and can be assimilated as a stream of complex numbers. It may convey information by coding it as a sequence of symbols sampled from a finite set of complex numbers called a "modulation". There exist several standard modulations such as (non exhaustive list): BPSK, QAM, QPSK of order N, PSK of order N…

3 papers0 benchmarksTime series

Dhaka Stock Exchange Historical Data

The dataset contains historical technical data of Dhaka Stock Exchange (DSE). The data was collected from different sources found in the internet where the data was publicly available. The data available here are used for information and research purposes and though to the best of our knowledge, it does not contain any mistakes, there might still be some mistakes. It is not encourages to use this dataset for portfolio management purposes and use this dataset out of your own interest. The contributors do not hold any liability if it is used for any purposes.

3 papers0 benchmarksTabular, Time series

PreviousPage 5 of 15Next