Datasets

271 machine learning datasets

271 dataset results

Polis export data: vTaiwan UberX

Engagement with the government of Taiwan as part of the vTaiwan participatory process which led to the successful regulation of Uber in Taiwan.

1 papers0 benchmarksTabular

Ghana house rental dataset

Dataset Description This dataset contains rental property listings scraped from Tonaton.com, one of Ghana's leading online classifieds platforms. It provides valuable information on rental prices across various regions in Ghana, along with other property details. The dataset is designed to support analysis, visualization, and modeling of rental prices in the Ghanaian real estate market.

1 papers0 benchmarksTabular

io-coordinated-replies

We release the datasets to replicate the results of `Coordinated Reply Attacks in Influence Operations: Characterization and Detection'.

1 papers0 benchmarksTabular

WEB-IDS23 Dataset

WEB-IDS23 is a network intrusion detection dataset that includes over 12 million flows, categorizing 20 attack types across FTP, HTTP/S, SMTP, SSH, and network scanning activities. This dataset is documented in the paper "Technical Report: Generating the WEB-IDS23 Dataset," which provides insights into the generation, structure, and key characteristics of the dataset.

1 papers0 benchmarksTabular

CompMix-IR

CompMix-IR Dataset Overview:

1 papers0 benchmarksGraphs, Tabular, Texts

ELMTEX Dataset (ELMTEX Dataset: Fine-Tuning Large Language Models for Structured Clinical Information Extraction)

We introduced a new dataset of clinical report summaries, annotated with structured information across 15 categories. This dataset was created to address the lack of large-scale resources for clinical IE. It also promotes the development of methods tailored to clinical data, helping to improve healthcare provision. The dataset contains 60, 000 annotated English clinical report summaries, from which we translated over 24, 000 examples into German.

1 papers0 benchmarksTabular

NBA_Box_Scores_Odds (NBA Team-Level Box Score Statistics (2015-2019), Historical Win Percentages (2014-2018) and Betting Odds (2018/2019))

Dataset Description: NBA Team Statistics, Historical Performance & Betting Odds (2015-2019) Overview This dataset contains team-level box score statistics, historical win percentages, and closing betting odds for NBA games from 2015 to 2019. It supports research in sports analytics, predictive modeling, and betting market efficiency.

1 papers0 benchmarksTabular

FER2013 Blendshapes (FER2013 blendshapes dataset example (Partial))

Tables of the blendshapes from a group of the images of the FER2013 dataset, generated using MediaPipe library, based on the ARKit face blendshapes. with classes of the images in a separate column, describing the categories Happy, Unknown, Sad.

1 papers0 benchmarks3d meshes, Images, Tabular, Tracking

Wearanize+ Dataset (v1.0)

Wearanize+ includes overnight sleep data from 130 participants (one night each) using three different wearable devices: Zmax headband, Empatica E4 wristband, and ActivPAL leg patch, alongside full-scale PSG recorded with SomnoScreen Plus and Mentalab Explore Pro. It also includes questionnaires, such as PSQI, MADRE, and PHQ-9, providing information on participants’ sleep, dreams, and overall health. (The link to access the dataset will be added soon).

1 papers0 benchmarksBiomedical, Tabular, Time series

LimeSoDa (Precision Liming Soil Datasets)

Precision Liming Soil Datasets (LimeSoDa) is a collection of 31 datasets from a field- and farm-scale soil mapping context. These datasets are "ready-to-use" for modeling purposes, as they include target soil properties and features in a tidy tabular format. Three target soil properties are present in every dataset: (1) soil organic matter (SOM) or soil organic carbon (SOC), (2) pH, and (3) clay content, while the features for modeling are dataset-specific. The primary goal of LimeSoDa is to enable more reliable benchmarking of machine learning methods in digital soil mapping and pedometrics. All the associated materials and data from LimeSoDa can be downloaded in Zenodo data repository or using the R or Python package implementations. However, for a more in-depth analysis, we refer to the published paper "LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping" by Schmidinger et al. (2025).

1 papers0 benchmarksTabular

opencl-llmperf

A collection of datasets and benchmarks for large-scale Performance Modeling with LLMs.

1 papers0 benchmarksTabular, Texts

Mapping Research Data at the University of Bologna (Mapping Research Data at the University of Bologna: Dataset)

This dataset was developed within an analysis of research data generated and managed within the University of Bologna, with respect to the differences and commonalities between disciplines and potential challenges for institutional data support services and infrastructures. We are primarily mapping the type (e.g., image), content (e.g., scan of a manuscript) and format (e.g., .tiff) of managed data, thus sustaining the value of FAIR data as granular resources.

1 papers0 benchmarksTabular

Songdo Traffic (Songdo Traffic: High Accuracy Georeferenced Vehicle Trajectories from a Large-Scale Study in a Smart City)

The Songdo Traffic dataset delivers precisely georeferenced vehicle trajectories captured through high-altitude bird's-eye view (BeV) drone footage over Songdo International Business District, South Korea. Comprising approximately 700,000 unique trajectories, this resource represents one of the most extensive aerial traffic datasets publicly available, distinguishing itself through exceptional temporal resolution that captures vehicle movements at 29.97 points per second, enabling unprecedented granularity for advanced urban mobility analysis.

1 papers0 benchmarksImages, Tabular, Time series, Tracking, Videos

Songdo Vision (Songdo Vision: Vehicle Annotations from High-Altitude BeV Drone Imagery in a Smart City)

The Songdo Vision dataset provides high-resolution (4K, 3840×2160 pixels) RGB images annotated with categorized axis-aligned bounding boxes (BBs) for vehicle detection from a high-altitude bird’s-eye view (BeV) perspective. Captured over Songdo International Business District, South Korea, this dataset consists of 5,419 annotated video frames, featuring approximately 300,000 vehicle instances categorized into four classes:

1 papers20 benchmarksImages, Tabular

GMSC (Give Me Some Credit)

Data for a Kaggle competition

1 papers0 benchmarksTabular

AgroLens (AgroLens Soil Prediction Dataset)

This dataset has been curated for a student research project at the Technische Hochschule Ingolstadt with Mi4Poeople and its Soil project (https://de.mi4people.org/soil-quality-evaluation-system).

1 papers0 benchmarksTabular

VoxcelebHeight (Voxceleb Height Dataset)

Enriched Voxceleb speakers' data of 1,715 celebrities with height gathered from Wikidata

1 papers0 benchmarksTabular

MERGE SPCS

This dataset contains pre-processed versions of datasets introduced in prior works. Additionally, it also contains new data that are pertinent to the paper.

1 papers0 benchmarksBiology, Biomedical, Images, Medical, Tables, Tabular

Data Storage System Performance

IOPS and Latency measurements of a real data storage system

1 papers0 benchmarksTables, Tabular

LLM evaluation scores (Scores given by LLM according to preassigned score)

Dataset is a CSV file, that contains evaluation scores given by a panel of LLMs to responses produced by other LLMs . Responses regard a forecasting task assigned to multiple LLMs. The evaluation of the individual forecasts are performed according to 9 criteria indicated in the prompt. (see for details https://arxiv.org/abs/2412.09385).

1 papers0 benchmarksTabular

PreviousPage 12 of 14Next