Datasets

271 machine learning datasets

271 dataset results

Satimage

The resources for this dataset can be found at https://www.openml.org/d/182

WDC SOTAB is a benchmark that features two annotation tasks: Column Type Annotation and Columns Property Annotation. The goal of the Column Type Annotation (CTA) task is to annotate the columns of a table with 91 Schema.org types, such as telephone, duration, Place, or Organization. The goal of the Columns Property Annotation (CPA) task is to annotate pairs of table columns with one out of 176 Schema.org properties, such as gtin13, startDate, priceValidUntil, or recipeIngredient. The benchmark consists of 59,548 tables annotated for CTA and 48,379 tables annotated for CPA originating from 74,215 different websites. The tables are split into training-, validation- and test sets for both tasks. The tables cover 17 popular Schema.org types including Product, LocalBusiness, Event, and JobPosting. The tables originate from the Schema.org Table Corpus.

2 papers4 benchmarksTabular

Vehicle Claims

The code to create the dataset is available here. The dataset used in the paper is available on github

2 papers2 benchmarksTabular

HumSet

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe

2 papers0 benchmarksTabular, Texts

Berlin V2X

The Berlin V2X dataset offers high-resolution GPS-located wireless measurements across diverse urban environments in the city of Berlin for both cellular and sidelink radio access technologies, acquired with up to 4 cars over 3 days. The data enables thus a variety of different ML studies towards vehicle-to-anything (V2X) communication.

2 papers0 benchmarksTabular, Time series

FINDSum (Financial Report Document Summarization)

FINDSum is a large-scale dataset for long text and multi-table summarization. It is built on 21,125 annual reports from 3,794 companies and has two subsets for summarizing each company’s results of operations and liquidity.

2 papers0 benchmarksTabular, Texts

GIRT-Data (GitHub Issue Report Template Dataset)

GIRT-Data is the first and largest dataset of issue report templates (IRTs) in both YAML and Markdown format. This dataset and its corresponding open-source crawler tool are intended to support research in this area and to encourage more developers to use IRTs in their repositories. The stable version of the dataset contains 1_084_300 repositories, and 50_032 of them support IRTs.

2 papers0 benchmarksTables, Tabular, Texts

Ranking social media news feed

A dataset consisting of recipient 46 users and, 26180 tweets. The dataset includes the news feed of the users and 13 features that may influence the relevance of the tweets.

2 papers0 benchmarksTabular

WikiTableSet (Wikipedia Table Image Dataset)

WikiTableSet is a large publicly available image-based table recognition dataset in three languages built from Wikipedia. WikiTableSet contains nearly 4 million English table images, 590K Japanese table images, 640k French table images with corresponding HTML representation, and cell bounding boxes. We build a Wikipedia table extractor WTabHTML and use this to extract tables (in HTML code format) from the 2022-03-01 dump of Wikipedia. In this study, we select Wikipedia tables from three representative languages, i.e., English, Japanese, and French; however, the dataset could be extended to around 300 languages with 17M tables using our table extractor. Second, we normalize the HTML tables following the PubTabNet format (separating table headers and table data, removing CSS and style tags). Finally, we use Chrome and Selenium to render table images from table HTML codes. This dataset provides a standard benchmark for studying table recognition algorithms in different languages or even

2 papers0 benchmarksImages, Tabular

ATMs fault prediction

The collected dataset consists of multivariate time series (MTS) data belonging to several ATMs banking along with the annotations that the operators did when they performed a maintenance task on any of the machines.

2 papers0 benchmarksTabular, Time series

TAP (Traffic Accident Prediction data repository)

The Traffic Accident Prediction (TAP) data repository offers extensive coverage for 1,000 US cities (TAP-city) and 49 states (TAP-state), providing real-world road structure data that can be easily used for graph-based machine learning methods such as Graph Neural Networks. Additionally, it features multi-dimensional geospatial attributes, including angular and directional features, that are useful for analyzing transportation networks. The TAP repository has the potential to benefit the research community in various applications, including traffic crash prediction, road safety analysis, and traffic crash mitigation. The datasets can be accessed in the TAP-city and TAP-state directories.

2 papers0 benchmarksTabular

Large-scale Ridesharing DARP Instances Based on Real Travel Demand

This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.

2 papers0 benchmarksGraphs, Tables, Tabular, Time series

FinBench

FinBench is a benchmark for evaluating the performance of machine learning models with both tabular data inputs and profile text inputs.

2 papers0 benchmarksTabular, Texts

WyzeRule

Wyze Rule Recommendation Dataset. It is a big dataset with 300,000 users. Please cite [1] if you used the dataset and cite [2] if you referenced the algorithm.

2 papers0 benchmarksGraphs, Tabular

kickstarter (Funding Successful Projects on Kickstarter)

Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative project to life. Till now, more than $3 billion dollars have been contributed by the members in fueling creative projects. The projects can be literally anything – a device, a game, an app, a film etc.

2 papers1 benchmarksTabular, Texts

bcTCGA (The Cancer Genome Atlas Program)

This data set comes from breast cancer tissue samples deposited to The Cancer Genome Atlas (TCGA) project. TCGA contains data on tumour samples were assayed on several platforms; this data set compiles results obtained using Agilent mRNA expression microarrays.

2 papers0 benchmarksTabular

news20 (NewsWeeder: learning to filter netnews)

Two datasets featuring binary and multi-class classification. The datasets were first introduced by K. Lang [1]. They can, for instance, be accessed at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

2 papers0 benchmarksTabular

e2006 (10-K Corpus)

From the official description:

2 papers0 benchmarksTabular

PanCancer Multimodal (HoneyBee)

Dataset Card for The Cancer Genome Atlas (TCGA) Multimodal Dataset

2 papers0 benchmarksImages, Medical, Tabular, Texts

UPenn-GBM (The University of Pennsylvania glioblastoma (UPenn-GBM) cohort)

This collection comprises multi-parametric magnetic resonance imaging (mpMRI) scans for de novo Glioblastoma (GBM) patients from the University of Pennsylvania Health System, coupled with patient demographics, clinical outcome (e.g., overall survival, genomic information, tumor progression), as well as computer-aided and manually-corrected segmentation labels of multiple histologically distinct tumor sub-regions, computer-aided and manually-corrected segmentations of the whole brain, a rich panel of radiomic features along with their corresponding co-registered mpMRI volumes in NIfTI format. Scans were initially skull-stripped and co-registered, before their tumor segmentation labels were produced by an automated computational method. These segmentation labels were revised and any label misclassifications were manually corrected/approved by expert board-certified neuroradiologists. The final labels were used to extract a rich panel of imaging features, including intensity, volumetric,

2 papers0 benchmarksMRI, Tabular

PreviousPage 5 of 14Next