19,997 machine learning datasets
19,997 dataset results
Description 12-channel lung sounds for each patient Multi-channel Analysis opportunity 5 COPD severities (COPD0, COPD1, COPD2, COPD3, COPD4) Short-term recordings (At least 17s)
A Tour & Travels Company Wants To Predict Whether A Customer Will Churn Or Not Based On Indicators Given Below. Help Build Predictive Models And Save The Company's Money. Perform Fascinating EDAs. The Data Was Used For Practice Purposes And Also During A Mini Hackathon, Its Completely Free To Use
HELOC The HELOC dataset from FICO. Each entry in the dataset is a line of credit, typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and its purchase price). The customers in this dataset have requested a credit line in the range of $5,000 - $150,000. The fundamental task is to use the information about the applicant in their credit report to predict whether they will repay their HELOC account within 2 years.
Public benchmark dataset (AeroPath), consisting of 27 CT images from patients with pathologies ranging from emphysema to large tumors, with corresponding trachea and bronchi annotations.
Researchy Questions is a set of about 100k Bing queries that users spent the most effort on. After a labor-intensive filtering funnel from billions of queries, these "needles in the haystack" are non-factoid, multi-perspective questions that require a lot of sub-questions and research to answer adequately. These questions are shown to be harder than other open domain QA datasets like Natural Questions.
This dataset provides the VCIP 2020 Grand Challenge on the NIR Image Colorization dataset. You can refer to https://jchenhkg.github.io/projects/NIR2RGB_VCIP_Challenge/ for a detailed description of this dataset. If you think this dataset is helpful, please feel free to cite our paper: @inproceedings{yang2023cooperative, title={Cooperative Colorization: Exploring Latent Cross-Domain Priors for NIR Image Spectrum Translation}, author={Yang, Xingxing and Chen, Jie and Yang, Zaifeng}, booktitle={Proceedings of the 31st ACM International Conference on Multimedia}, pages={2409--2417}, year={2023} }
We introduce here our Large Time Lags Location (LTLL) dataset containing pictures of 25 locations captured over a range of more than 150 years. Specifically we collected images from several cities and towns in Europe such as Paris, London, Merelbeke, Leuven and ancient cities from Asia such as Agra in India, Colombo and Kandy from Sri Lanka. We chose thirteen locations considering the presence of well known landmarks for which it has been easy to download old and new pictures from the Web. The rest of the twelve locations are located in the municipality of Merelbeke, Flemish Province of East Flanders in Belgium. Ancient images of historical locations dating back to the period 1850s-1950s have been provided by the museum in Merelbeke. We downloaded all the corresponding modern images from Flickr, Google Street-View and the Google-Images search engine. In total the dataset contains 225 historical pictures and 275 modern ones.
We present datasets containing urban traffic and rural road scenes recorded using hyperspectral snap-shot sensors mounted on a moving car. The novel hyperspectral cameras used can capture whole spectral cubes at up to 15 Hz. This emerging new sensor modality enables hyperspectral scene analysis for autonomous driving tasks. Up to the best of the author’s knowledge no such dataset has been published so far. The datasets contain synchronized 3-D laser, spectrometer and hyperspectral data. Dense ground truth annotations are provided as semantic labels, material and traversability. The hyperspectral data ranges from visible to near infrared wavelengths. We explain our recoding platform and method, the associated data format along with a code library for easy data consumption. The datasets are publicly available for download.
HSI-Drive is the hyperspectral image (HSI) dataset created by the Digital Electronics Design Group (GDED) of the University of the Basque Country (UPV/EHU). This database is intended to contribute to the research into the use of hyperspectral imaging for the development of advanced driver assistance systems (ADAS) and autonomous driving systems (ADS). The dataset contains a diverse set of images recorded with a small-size 25-band VNIR snapshot camera mounted on a moving automobile. The recordings have been made in different seasons of the year, at different day times, under different weather conditions and on different types of roads. The dataset contains images and videos classified and tagged accordingly to provide rich and diverse data.
High-quality underwater coral detection dataset for machine learning and computer vision research.
The dataset was proposed in LLaVA-CoT: Let Vision Language Models Reason Step-by-Step.
The D4LA dataset is a diverse benchmark for document layout analysis (DLA) derived from the RVL-CDIP dataset. It focuses on 12 document types with rich layouts, each represented by approximately 1,000 manually annotated images, while filtering out noisy, handwritten, artistic, or text-scarce images. The dataset defines 27 detailed layout categories, including DocTitle, ListText, Header, Table, Equation, and Footer, among others, catering to real-world applications.
The CTU Relational Learning Repository offers relational database datasets to the machine learning community. It currently hosts 148 SQL databases on a public MySQL server. A searchable meta-database provides key metadata, such as the number of tables, rows, columns, and self-relationships within each database.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
TriBERT dataset consists of 12,049 training, 2,527 validation and 2,560 test Human-Machine collaborative texts. Each text contains both human-written and LLM-generated parts, which can appear in different orders (human → AI, AI → human). Therefore, each sample has between 1 and 3 boundaries, indicating the sentences where authorship changes. The texts were created using humanwritten essays with LLM-generated sections added using ChatGPT.
Dataset Overview This dataset contains individual-level data from a randomized controlled trial (RCT) conducted in northern Uganda, along with associated satellite imagery. It is designed to investigate how treatment effects may vary across different geographical and contextual settings by leveraging both tabular and image-based variables.
Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach enables a multi-level comprehension of surgical activities, encompassing long-term tasks such as surgical phases and steps recognition and short-term tasks including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation, we demonstrate the impact of including segmentation annotations in short-term recognition tasks, highlight the varying granularity requirements of each task, and
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This dataset contains time-stamped user retweet event sequences. The events are categorized into 3 types: retweets by “small,” “medium” and “large” users. Small users have fewer than 120 followers, medium users have fewer than 1363, and the rest are large users.
A filtered version of CronQuestions and which can better demonstrate the model’s inference ability for complex temporal questions.