271 machine learning datasets
271 dataset results
The resources for this dataset can be found at https://www.openml.org/d/182
WDC SOTAB is a benchmark that features two annotation tasks: Column Type Annotation and Columns Property Annotation. The goal of the Column Type Annotation (CTA) task is to annotate the columns of a table with 91 Schema.org types, such as telephone, duration, Place, or Organization. The goal of the Columns Property Annotation (CPA) task is to annotate pairs of table columns with one out of 176 Schema.org properties, such as gtin13, startDate, priceValidUntil, or recipeIngredient. The benchmark consists of 59,548 tables annotated for CTA and 48,379 tables annotated for CPA originating from 74,215 different websites. The tables are split into training-, validation- and test sets for both tasks. The tables cover 17 popular Schema.org types including Product, LocalBusiness, Event, and JobPosting. The tables originate from the Schema.org Table Corpus.
The code to create the dataset is available here. The dataset used in the paper is available on github
Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe
The Berlin V2X dataset offers high-resolution GPS-located wireless measurements across diverse urban environments in the city of Berlin for both cellular and sidelink radio access technologies, acquired with up to 4 cars over 3 days. The data enables thus a variety of different ML studies towards vehicle-to-anything (V2X) communication.
FINDSum is a large-scale dataset for long text and multi-table summarization. It is built on 21,125 annual reports from 3,794 companies and has two subsets for summarizing each company’s results of operations and liquidity.
GIRT-Data is the first and largest dataset of issue report templates (IRTs) in both YAML and Markdown format. This dataset and its corresponding open-source crawler tool are intended to support research in this area and to encourage more developers to use IRTs in their repositories. The stable version of the dataset contains 1_084_300 repositories, and 50_032 of them support IRTs.
A dataset consisting of recipient 46 users and, 26180 tweets. The dataset includes the news feed of the users and 13 features that may influence the relevance of the tweets.
WikiTableSet is a large publicly available image-based table recognition dataset in three languages built from Wikipedia. WikiTableSet contains nearly 4 million English table images, 590K Japanese table images, 640k French table images with corresponding HTML representation, and cell bounding boxes. We build a Wikipedia table extractor WTabHTML and use this to extract tables (in HTML code format) from the 2022-03-01 dump of Wikipedia. In this study, we select Wikipedia tables from three representative languages, i.e., English, Japanese, and French; however, the dataset could be extended to around 300 languages with 17M tables using our table extractor. Second, we normalize the HTML tables following the PubTabNet format (separating table headers and table data, removing CSS and style tags). Finally, we use Chrome and Selenium to render table images from table HTML codes. This dataset provides a standard benchmark for studying table recognition algorithms in different languages or even
The collected dataset consists of multivariate time series (MTS) data belonging to several ATMs banking along with the annotations that the operators did when they performed a maintenance task on any of the machines.
The Traffic Accident Prediction (TAP) data repository offers extensive coverage for 1,000 US cities (TAP-city) and 49 states (TAP-state), providing real-world road structure data that can be easily used for graph-based machine learning methods such as Graph Neural Networks. Additionally, it features multi-dimensional geospatial attributes, including angular and directional features, that are useful for analyzing transportation networks. The TAP repository has the potential to benefit the research community in various applications, including traffic crash prediction, road safety analysis, and traffic crash mitigation. The datasets can be accessed in the TAP-city and TAP-state directories.
This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.
FinBench is a benchmark for evaluating the performance of machine learning models with both tabular data inputs and profile text inputs.
Wyze Rule Recommendation Dataset. It is a big dataset with 300,000 users. Please cite [1] if you used the dataset and cite [2] if you referenced the algorithm.
Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative project to life. Till now, more than $3 billion dollars have been contributed by the members in fueling creative projects. The projects can be literally anything – a device, a game, an app, a film etc.
This data set comes from breast cancer tissue samples deposited to The Cancer Genome Atlas (TCGA) project. TCGA contains data on tumour samples were assayed on several platforms; this data set compiles results obtained using Agilent mRNA expression microarrays.
Two datasets featuring binary and multi-class classification. The datasets were first introduced by K. Lang [1]. They can, for instance, be accessed at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
From the official description:
Dataset Card for The Cancer Genome Atlas (TCGA) Multimodal Dataset <!-- Provide a quick summary of the dataset. -->
This collection comprises multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System, coupled with patient demographics, clinical outcome (e.g., overall survival, genomic information, tumor progression), as well as computer-aided and manually-corrected segmentation labels of multiple histologically distinct tumor sub-regions, computer-aided and manually-corrected segmentations of the whole brain, a rich panel of radiomic features along with their corresponding co-registered mpMRI volumes in NIfTI format. Scans were initially skull-stripped and co-registered, before their tumor segmentation labels were produced by an automated computational method. These segmentation labels were revised and any label misclassifications were manually corrected/approved by expert board-certified neuroradiologists. The final labels were used to extract a rich panel of imaging features, including intensity, volumetric,