271 machine learning datasets
271 dataset results
Engagement with the government of Taiwan as part of the vTaiwan participatory process which led to the successful regulation of Uber in Taiwan.
Dataset Description This dataset contains rental property listings scraped from Tonaton.com, one of Ghana's leading online classifieds platforms. It provides valuable information on rental prices across various regions in Ghana, along with other property details. The dataset is designed to support analysis, visualization, and modeling of rental prices in the Ghanaian real estate market.
We release the datasets to replicate the results of `Coordinated Reply Attacks in Influence Operations: Characterization and Detection'.
WEB-IDS23 is a network intrusion detection dataset that includes over 12 million flows, categorizing 20 attack types across FTP, HTTP/S, SMTP, SSH, and network scanning activities. This dataset is documented in the paper "Technical Report: Generating the WEB-IDS23 Dataset," which provides insights into the generation, structure, and key characteristics of the dataset.
CompMix-IR Dataset Overview:
We introduced a new dataset of clinical report summaries, annotated with structured information across 15 categories. This dataset was created to address the lack of large-scale resources for clinical IE. It also promotes the development of methods tailored to clinical data, helping to improve healthcare provision. The dataset contains 60, 000 annotated English clinical report summaries, from which we translated over 24, 000 examples into German.
Dataset Description: NBA Team Statistics, Historical Performance & Betting Odds (2015-2019) Overview This dataset contains team-level box score statistics, historical win percentages, and closing betting odds for NBA games from 2015 to 2019. It supports research in sports analytics, predictive modeling, and betting market efficiency.
Tables of the blendshapes from a group of the images of the FER2013 dataset, generated using MediaPipe library, based on the ARKit face blendshapes. with classes of the images in a separate column, describing the categories Happy, Unknown, Sad.
Wearanize+ includes overnight sleep data from 130 participants (one night each) using three different wearable devices: Zmax headband, Empatica E4 wristband, and ActivPAL leg patch, alongside full-scale PSG recorded with SomnoScreen Plus and Mentalab Explore Pro. It also includes questionnaires, such as PSQI, MADRE, and PHQ-9, providing information on participants’ sleep, dreams, and overall health. (The link to access the dataset will be added soon).
Precision Liming Soil Datasets (LimeSoDa) is a collection of 31 datasets from a field- and farm-scale soil mapping context. These datasets are "ready-to-use" for modeling purposes, as they include target soil properties and features in a tidy tabular format. Three target soil properties are present in every dataset: (1) soil organic matter (SOM) or soil organic carbon (SOC), (2) pH, and (3) clay content, while the features for modeling are dataset-specific. The primary goal of LimeSoDa is to enable more reliable benchmarking of machine learning methods in digital soil mapping and pedometrics. All the associated materials and data from LimeSoDa can be downloaded in Zenodo data repository or using the R or Python package implementations. However, for a more in-depth analysis, we refer to the published paper "LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping" by Schmidinger et al. (2025).
A collection of datasets and benchmarks for large-scale Performance Modeling with LLMs.
This dataset was developed within an analysis of research data generated and managed within the University of Bologna, with respect to the differences and commonalities between disciplines and potential challenges for institutional data support services and infrastructures. We are primarily mapping the type (e.g., image), content (e.g., scan of a manuscript) and format (e.g., .tiff) of managed data, thus sustaining the value of FAIR data as granular resources.
The Songdo Traffic dataset delivers precisely georeferenced vehicle trajectories captured through high-altitude bird's-eye view (BeV) drone footage over Songdo International Business District, South Korea. Comprising approximately 700,000 unique trajectories, this resource represents one of the most extensive aerial traffic datasets publicly available, distinguishing itself through exceptional temporal resolution that captures vehicle movements at 29.97 points per second, enabling unprecedented granularity for advanced urban mobility analysis.
The Songdo Vision dataset provides high-resolution (4K, 3840×2160 pixels) RGB images annotated with categorized axis-aligned bounding boxes (BBs) for vehicle detection from a high-altitude bird’s-eye view (BeV) perspective. Captured over Songdo International Business District, South Korea, this dataset consists of 5,419 annotated video frames, featuring approximately 300,000 vehicle instances categorized into four classes:
Data for a Kaggle competition
This dataset has been curated for a student research project at the Technische Hochschule Ingolstadt with Mi4Poeople and its Soil project (https://de.mi4people.org/soil-quality-evaluation-system).
Enriched Voxceleb speakers' data of 1,715 celebrities with height gathered from Wikidata
This dataset contains pre-processed versions of datasets introduced in prior works. Additionally, it also contains new data that are pertinent to the paper.
IOPS and Latency measurements of a real data storage system
Dataset is a CSV file, that contains evaluation scores given by a panel of LLMs to responses produced by other LLMs . Responses regard a forecasting task assigned to multiple LLMs. The evaluation of the individual forecasts are performed according to 9 criteria indicated in the prompt. (see for details https://arxiv.org/abs/2412.09385).