Datasets

19,997 machine learning datasets

19,997 dataset results

RespiratoryDatabase@TR

Description 12-channel lung sounds for each patient Multi-channel Analysis opportunity 5 COPD severities (COPD0, COPD1, COPD2, COPD3, COPD4) Short-term recordings (At least 17s)

3 papers1 benchmarks

Travel (Tour & Travels Customer Churn Prediction)

A Tour & Travels Company Wants To Predict Whether A Customer Will Churn Or Not Based On Indicators Given Below. Help Build Predictive Models And Save The Company's Money. Perform Fascinating EDAs. The Data Was Used For Practice Purposes And Also During A Mini Hackathon, Its Completely Free To Use

3 papers4 benchmarksTabular

HELOC (Home Equity Line of Credit)

HELOC The HELOC dataset from FICO. Each entry in the dataset is a line of credit, typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and its purchase price). The customers in this dataset have requested a credit line in the range of $5,000 - $150,000. The fundamental task is to use the information about the applicant in their credit report to predict whether they will repay their HELOC account within 2 years.

3 papers4 benchmarksTabular

AeroPath (AeroPath: An airway segmentation benchmark dataset with challenging pathology)

Public benchmark dataset (AeroPath), consisting of 27 CT images from patients with pathologies ranging from emphysema to large tumors, with corresponding trachea and bronchi annotations.

3 papers0 benchmarks3D, Medical

Researchy Questions

Researchy Questions is a set of about 100k Bing queries that users spent the most effort on. After a labor-intensive filtering funnel from billions of queries, these "needles in the haystack" are non-factoid, multi-perspective questions that require a lot of sub-questions and research to answer adequately. These questions are shown to be harder than other open domain QA datasets like Natural Questions.

3 papers0 benchmarks

NIR2RGB VCIP Challange Dataset (VCIP2020 Grand Challenge on the NIR image colorization dataset)

This dataset provides the VCIP 2020 Grand Challenge on the NIR Image Colorization dataset. You can refer to https://jchenhkg.github.io/projects/NIR2RGB_VCIP_Challenge/ for a detailed description of this dataset. If you think this dataset is helpful, please feel free to cite our paper: @inproceedings{yang2023cooperative, title={Cooperative Colorization: Exploring Latent Cross-Domain Priors for NIR Image Spectrum Translation}, author={Yang, Xingxing and Chen, Jie and Yang, Zaifeng}, booktitle={Proceedings of the 31st ACM International Conference on Multimedia}, pages={2409--2417}, year={2023} }

3 papers1 benchmarksImages

Large Time Lags Location (LTLL)

We introduce here our Large Time Lags Location (LTLL) dataset containing pictures of 25 locations captured over a range of more than 150 years. Specifically we collected images from several cities and towns in Europe such as Paris, London, Merelbeke, Leuven and ancient cities from Asia such as Agra in India, Colombo and Kandy from Sri Lanka. We chose thirteen locations considering the presence of well known landmarks for which it has been easy to download old and new pictures from the Web. The rest of the twelve locations are located in the municipality of Merelbeke, Flemish Province of East Flanders in Belgium. Ancient images of historical locations dating back to the period 1850s-1950s have been provided by the museum in Merelbeke. We downloaded all the corresponding modern images from Flickr, Google Street-View and the Google-Images search engine. In total the dataset contains 225 historical pictures and 275 modern ones.

3 papers2 benchmarks

HyKo2-VIS

We present datasets containing urban traffic and rural road scenes recorded using hyperspectral snap-shot sensors mounted on a moving car. The novel hyperspectral cameras used can capture whole spectral cubes at up to 15 Hz. This emerging new sensor modality enables hyperspectral scene analysis for autonomous driving tasks. Up to the best of the author’s knowledge no such dataset has been published so far. The datasets contain synchronized 3-D laser, spectrometer and hyperspectral data. Dense ground truth annotations are provided as semantic labels, material and traversability. The hyperspectral data ranges from visible to near infrared wavelengths. We explain our recoding platform and method, the associated data format along with a code library for easy data consumption. The datasets are publicly available for download.

3 papers8 benchmarksHyperspectral images, Images

HSI-Drive v2.0

HSI-Drive is the hyperspectral image (HSI) dataset created by the Digital Electronics Design Group (GDED) of the University of the Basque Country (UPV/EHU). This database is intended to contribute to the research into the use of hyperspectral imaging for the development of advanced driver assistance systems (ADAS) and autonomous driving systems (ADS). The dataset contains a diverse set of images recorded with a small-size 25-band VNIR snapshot camera mounted on a moving automobile. The recordings have been made in different seasons of the year, at different day times, under different weather conditions and on different types of roads. The dataset contains images and videos classified and tagged accordingly to provide rich and diverse data.

3 papers8 benchmarksHyperspectral images

SCoralDet Dataset (Soft-Coral Detection Dataset)

High-quality underwater coral detection dataset for machine learning and computer vision research.

3 papers2 benchmarksImages

LLaVA-CoT-100K

The dataset was proposed in LLaVA-CoT: Let Vision Language Models Reason Step-by-Step.

3 papers0 benchmarksImages, Texts

D4LA

The D4LA dataset is a diverse benchmark for document layout analysis (DLA) derived from the RVL-CDIP dataset. It focuses on 12 document types with rich layouts, each represented by approximately 1,000 manually annotated images, while filtering out noisy, handwritten, artistic, or text-scarce images. The dataset defines 27 detailed layout categories, including DocTitle, ListText, Header, Table, Equation, and Footer, among others, catering to real-world applications.

3 papers2 benchmarks

CTU Relational (The CTU Prague Relational Learning Repository)

The CTU Relational Learning Repository offers relational database datasets to the machine learning community. It currently hosts 148 SQL databases on a public MySQL server. A searchable meta-database provides key metadata, such as the number of tables, rows, columns, and self-relationships within each database.

3 papers0 benchmarksGraphs, Tabular

OMP_serial

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

3 papers0 benchmarks

TriBERT

TriBERT dataset consists of 12,049 training, 2,527 validation and 2,560 test Human-Machine collaborative texts. Each text contains both human-written and LLM-generated parts, which can appear in different orders (human → AI, AI → human). Therefore, each sample has between 1 and 3 boundaries, indicating the sentences where authorship changes. The texts were created using humanwritten essays with LLM-generated sections added using ChatGPT.

3 papers0 benchmarksTexts

Replication Data for: Image-based Treatment Effect Heterogeneity

Dataset Overview This dataset contains individual-level data from a randomized controlled trial (RCT) conducted in northern Uganda, along with associated satellite imagery. It is designed to investigate how treatment effects may vary across different geographical and contextual settings by leveraging both tabular and image-based variables.

3 papers0 benchmarksImages, Tabular

GraSP (Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies)

Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach enables a multi-level comprehension of surgical activities, encompassing long-term tasks such as surgical phases and steps recognition and short-term tasks including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation, we demonstrate the impact of including segmentation annotations in short-term recognition tasks, highlight the varying granularity requirements of each task, and

3 papers1 benchmarksMedical, Videos

MetRex

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

3 papers0 benchmarks

Retweet MTPP (Marked Temporal Point Processes on Retweet data)

This dataset contains time-stamped user retweet event sequences. The events are categorized into 3 types: retweets by “small,” “medium” and “large” users. Small users have fewer than 120 followers, medium users have fewer than 1363, and the rest are large users.

3 papers4 benchmarksTabular, Time series

Complex-CronQuestions

A filtered version of CronQuestions and which can better demonstrate the model’s inference ability for complex temporal questions.

3 papers1 benchmarks

PreviousPage 294 of 1000Next