Datasets

271 machine learning datasets

271 dataset results

Kvasir-VQA (A Text-Image Pair GI Tract Dataset)

The Kvasir-VQA dataset is an extended dataset derived from the HyperKvasir and Kvasir-Instrument datasets, augmented with question-and-answer annotations. This dataset is designed to facilitate advanced machine learning tasks in gastrointestinal (GI) diagnostics, including image captioning, Visual Question Answering (VQA) and text-based generation of synthetic medical images.

3 papers0 benchmarksImages, Medical, Tabular, Texts

Travel (Tour & Travels Customer Churn Prediction)

A Tour & Travels Company Wants To Predict Whether A Customer Will Churn Or Not Based On Indicators Given Below. Help Build Predictive Models And Save The Company's Money. Perform Fascinating EDAs. The Data Was Used For Practice Purposes And Also During A Mini Hackathon, Its Completely Free To Use

3 papers4 benchmarksTabular

HELOC (Home Equity Line of Credit)

HELOC The HELOC dataset from FICO. Each entry in the dataset is a line of credit, typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and its purchase price). The customers in this dataset have requested a credit line in the range of $5,000 - $150,000. The fundamental task is to use the information about the applicant in their credit report to predict whether they will repay their HELOC account within 2 years.

3 papers4 benchmarksTabular

CTU Relational (The CTU Prague Relational Learning Repository)

The CTU Relational Learning Repository offers relational database datasets to the machine learning community. It currently hosts 148 SQL databases on a public MySQL server. A searchable meta-database provides key metadata, such as the number of tables, rows, columns, and self-relationships within each database.

3 papers0 benchmarksGraphs, Tabular

Replication Data for: Image-based Treatment Effect Heterogeneity

Dataset Overview This dataset contains individual-level data from a randomized controlled trial (RCT) conducted in northern Uganda, along with associated satellite imagery. It is designed to investigate how treatment effects may vary across different geographical and contextual settings by leveraging both tabular and image-based variables.

3 papers0 benchmarksImages, Tabular

Retweet MTPP (Marked Temporal Point Processes on Retweet data)

This dataset contains time-stamped user retweet event sequences. The events are categorized into 3 types: retweets by “small,” “medium” and “large” users. Small users have fewer than 120 followers, medium users have fewer than 1363, and the rest are large users.

3 papers4 benchmarksTabular, Time series

COVID19-Algeria-and-World-Dataset

A coronavirus dataset with 98 countries constructed from different reliable sources, where each row represents a country, and the columns represent geographic, climate, healthcare, economic, and demographic factors that may contribute to accelerate/slow the spread of the COVID-19. The assumptions for the different factors are as follows:

2 papers0 benchmarksTabular

News Interactions on Globo.com (News Portal User Interactions by Globo.com - A large dataset for news recommendations offline evaluation and analytics)

Context This large dataset with users interactions logs (page views) from a news portal was kindly provided by Globo.com, the most popular news portal in Brazil, for reproducibility of the experiments with CHAMELEON - a meta-architecture for contextual hybrid session-based news recommender systems. The source code was made available at GitHub.

2 papers0 benchmarksTabular

SNDZoo (The Softwarised Network Data Zoo)

The softwarised network data zoo (SNDZoo) is an open collection of software networking data sets aiming to streamline and ease machine learning research in the software networking domain. Most of the published data sets focus on, but are not limited to, the performance of virtualised network functions (VNFs). The data is collected using fully automated NFV benchmarking frameworks, such as tng-bench, developed by us or third party solutions like Gym. The collection of the presented data sets follows the general VNF benchmarking methodology described in.

2 papers0 benchmarksTabular, Time series, Tracking

Data Collected with Package Delivery Quadcopter Drone

This experiment was performed in order to empirically measure the energy use of small, electric Unmanned Aerial Vehicles (UAVs). We autonomously direct a DJI ® Matrice 100 (M100) drone to take off, carry a range of payload weights on a triangular flight pattern, and land. Between flights, we varied specified parameters through a set of discrete options, payload of 0 , 250 g and 500 g; altitude during cruise of 25 m, 50 m, 75 m and 100 m; and speed during cruise of 4 m/s, 6 m/s, 8 m/s, 10 m/s and 12 m/s.

2 papers2 benchmarksTabular, Time series

Titanic (Titanic - Machine Learning from Disaster)

Titanic Dataset Description Overview The data is divided into two groups: - Training set (train.csv): Used to build machine learning models. It includes the outcome (also called the "ground truth") for each passenger, allowing models to predict survival based on “features” like gender and class. Feature engineering can also be applied to create new features. - Test set (test.csv): Used to evaluate model performance on unseen data. The ground truth is not provided; the task is to predict survival for each passenger in the test set using the trained model.

2 papers1 benchmarksTabular

TNCR Dataset (Table Net Detection and Classification Dataset)

We present TNCR, a new table dataset with varying image quality collected from free open source websites. TNCR dataset can be used for table detection in scanned document images and their classification into 5 different classes.

2 papers0 benchmarksImages, Tabular

SportSett

This resource is designed to allow for research into Natural Language Generation. In particular, with neural data-to-text approaches although it is not limited to these.

2 papers0 benchmarksTabular, Texts

Replication Data for: "Empirical Analysis of EIP-1559: Transaction Fees, Waiting Time, and Consensus Security"

Transaction fee mechanism (TFM) is an essential component of a blockchain protocol. However, a systematic evaluation of the real-world impact of TFMs is still absent. Using rich data from the Ethereum blockchain, mempool, and exchanges, we study the effect of EIP-1559, one of the first deployed TFMs that depart from the traditional first-price auction paradigm. We conduct a rigorous and comprehensive empirical study to examine its causal effect on blockchain transaction fee dynamics, transaction waiting time and security. Our results show that EIP-1559 improves the user experience by making fee estimation easier, mitigating intra-block difference of gas price paid, and reducing users' waiting times. However, EIP-1559 has only a small effect on gas fee levels and consensus security. In addition, we found that when Ether's price is more volatile, the waiting time is significantly higher. We also verify that a larger block size increases the presence of siblings. These findings suggest ne

2 papers0 benchmarksTabular

Multivariate-Mobility-Paris

The original dataset was provided by Orange telecom in France, which contains anonymized and aggregated human mobility data. The Multivariate-Mobility-Paris dataset comprises information from 2020-08-24 to 2020-11-04 (72 days during the COVID-19 pandemic), with time granularity of 30 minutes and spatial granularity of 6 coarse regions in Paris, France. In other words, it represents a multivariate time series dataset.

2 papers0 benchmarksTables, Tabular

BreastClassifications4 ([MIMBCD-UI] UTA4: Severity & Pathology Classifications Dataset)

Several datasets are fostering innovation in higher-level functions for everyone, everywhere. By providing this repository, we hope to encourage the research community to focus on hard problems. In this repository, we present the real results severity (BIRADS) and pathology (post-report) classifications provided by the Radiologist Director from the Radiology Department of Hospital Fernando Fonseca while diagnosing several patients (see dataset-uta4-dicom) from our User Tests and Analysis 4 (UTA4) study. Here, we provide a dataset for the measurements of both severity (BIRADS) and pathology classifications concerning the patient diagnostic. Work and results are published on a top Human-Computer Interaction (HCI) conference named AVI 2020 (page). Results were analyzed and interpreted from our Statistical Analysis charts. The user tests were made in clinical institutions, where clinicians diagnose several patients for a Single-Modality vs Multi-Modality comparison. For example, in these t

2 papers0 benchmarksBiomedical, Images, Medical, Tabular

Summaries of genetic variation

The dataset represents data generated from a commonly used model in population genetics. It comprises a matrix of 1,000,000 rows and 9 columns, representing parameters and summaries generated by an infinite-sites coalescent model for genetic variation. The first two columns encode the scaled mutation rate (theta) and scaled recombination rate (rho). The subsequent seven columns are data summaries: number of segregating sites (C1), standard uniform random noise acting as a distractor (C2), pairwise mean number of nucleotidic differences (C3), mean $R^2$ across pairs separated by <10% of the simulated genomic regions (C4), number of distinct haplotypes (C5), frequency of the most common haplotype (C6), number of singleton haplotypes (C7).

2 papers0 benchmarksBiology, Tabular

PreviousPage 4 of 14Next

Datasets

Kvasir-VQA (A Text-Image Pair GI Tract Dataset)

Travel (Tour & Travels Customer Churn Prediction)

HELOC (Home Equity Line of Credit)

CTU Relational (The CTU Prague Relational Learning Repository)

Replication Data for: Image-based Treatment Effect Heterogeneity

Retweet MTPP (Marked Temporal Point Processes on Retweet data)

COVID19-Algeria-and-World-Dataset

News Interactions on Globo.com (News Portal User Interactions by Globo.com - A large dataset for news recommendations offline evaluation and analytics)

SNDZoo (The Softwarised Network Data Zoo)

Data Collected with Package Delivery Quadcopter Drone

Titanic (Titanic - Machine Learning from Disaster)

TNCR Dataset (Table Net Detection and Classification Dataset)

SportSett

Replication Data for: "Empirical Analysis of EIP-1559: Transaction Fees, Waiting Time, and Consensus Security"

Multivariate-Mobility-Paris

BreastClassifications4 ([MIMBCD-UI] UTA4: Severity & Pathology Classifications Dataset)

Summaries of genetic variation

Hotel (Hospitality > Tourism > Hotel Demand/Sales)

Active TLS Stack Fingerprinting Measurement Data

IHDS (Indian Human Developement Survey)

Datasets

Kvasir-VQA (A Text-Image Pair GI Tract Dataset)

Travel (Tour & Travels Customer Churn Prediction)

HELOC (Home Equity Line of Credit)

CTU Relational (The CTU Prague Relational Learning Repository)

Replication Data for: Image-based Treatment Effect Heterogeneity

Retweet MTPP (Marked Temporal Point Processes on Retweet data)

COVID19-Algeria-and-World-Dataset

News Interactions on Globo.com (News Portal User Interactions by Globo.com - A large dataset for news recommendations offline evaluation and analytics)

SNDZoo (The Softwarised Network Data Zoo)

Data Collected with Package Delivery Quadcopter Drone

Titanic (Titanic - Machine Learning from Disaster)

TNCR Dataset (Table Net Detection and Classification Dataset)

SportSett

Replication Data for: "Empirical Analysis of EIP-1559: Transaction Fees, Waiting Time, and Consensus Security"

Multivariate-Mobility-Paris

BreastClassifications4 ([MIMBCD-UI] UTA4: Severity & Pathology Classifications Dataset)

Summaries of genetic variation

Hotel (Hospitality > Tourism > Hotel Demand/Sales)

Active TLS Stack Fingerprinting Measurement Data

IHDS (Indian Human Developement Survey)