19,997 machine learning datasets
19,997 dataset results
The GeoLifeCLEF 2020 dataset is a large-scale remote sensing dataset. More specifically, it consists of 1.9 million species observations from the community science platform iNaturalist, each of which is paired with high-resolution covariates (RGB-IR imagery, land cover, and altitude). The dataset is roughly evenly split between the US and France, and covers over 31k plant and animal species.
QAConv is a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations including business emails, panel discussions, and work channels. Unlike opendomain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions, from 10,259 selected conversations with both human-written and machine-generated questions. We segment long conversations into chunks, and use a question generator and dialogue summarizer as auxiliary tools to collect multi-hop questions. The dataset has two testing scenarios, chunk mode and full mode, depending on whether the grounded chunk is provided or retrieved from a large conversational pool.
The German Breast Cancer Study Group (GBSG2) dataset studies the effects of hormone treatment on recurrence-free survival time. The event of interest is the recurrence of cancer time. This data frame contains the observations of 686 women:
TAU Urban Acoustic Scenes 2019 Mobile development dataset consists of 10-seconds audio segments from 10 acoustic scenes:
DaN+ is a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language.
CURE-OR is a large-scale, controlled, and multi-platform object recognition dataset denoted as Challenging Unreal and Real Environments for Object Recognition. In this dataset, there are 1,000,000 images of 100 objects with varying size, color, and texture that are positioned in five different orientations and captured using five devices including a webcam, a DSLR, and three smartphone cameras in real-world (real) and studio (unreal) environments. The controlled challenging conditions include underexposure, overexposure, blur, contrast, dirty lens, image noise, resizing, and loss of color information.
GOO (Gaze-on-Objects) is a dataset for gaze object prediction, where the goal is to predict a bounding box for a person's gazed-at object. GOO is composed of a large set of synthetic images (GOO Synth) supplemented by a smaller subset of real images (GOO-Real) of people looking at objects in a retail environment.
ParaQA is a question answering (QA) dataset with multiple paraphrased responses for single-turn conversation over knowledge graphs (KG). The dataset was created using a semi-automated framework for generating diverse paraphrasing of the answers using techniques such as back-translation. The existing datasets for conversational question answering over KGs (single-turn/multi-turn) focus on question paraphrasing and provide only up to one answer verbalization. However, ParaQA contains 5000 question-answer pairs with a minimum of two and a maximum of eight unique paraphrased responses for each question.
The EDT dataset is designed for corporate event detection and text-based stock prediction (trading strategy) benchmark.
Introduction
The NTIRE 2021 HDR was built for the first challenge on high-dynamic range (HDR) imaging that was part of the New Trends in Image Restoration and Enhancement (NTIRE) workshop, held in conjunction with CVPR 2021. The challenge aims at estimating a HDR image from one or multiple respective low-dynamic range (LDR) observations, which might suffer from under- or over-exposed regions and different sources of noise. The challenge is composed by two tracks: In Track 1 only a single LDR image is provided as input, whereas in Track 2 three differently-exposed LDR images with inter-frame motion are available. In both tracks, the ultimate goal is to achieve the best objective HDR reconstruction in terms of PSNR with respect to a ground-truth image, evaluated both directly and with a canonical tone mapping operation.
ToyADMOS2 is a dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions.
The complete blood count (CBC) dataset contains 360 blood smear images along with their annotation files splitting into Training, Testing, and Validation sets. The training folder contains 300 images with annotations. The testing and validation folder both contain 60 images with annotations. We have done some modifications over the original dataset to prepare this CBC dataset where some of the image annotation files contain very low red blood cells (RBCs) than actual and one annotation file does not include any RBC at all although the cell smear image contains RBCs. So, we clear up all the fallacious files and split the dataset into three parts. Among the 360 smear images, 300 blood cell images with annotations are used as the training set first, and then the rest of the 60 images with annotations are used as the testing set. Due to the shortage of data, a subset of the training set is used to prepare the validation set which contains 60 images with annotations.
a high-level explanation of the dataset characteristics explain motivations and summary of its content potential use cases of the dataset
ZeroWaste is a dataset for automatic waste detection and segmentation. This dataset contains over 1,800 fully segmented video frames collected from a real waste sorting plant along with waste material labels for training and evaluation of the segmentation methods, as well as over 6,000 unlabeled frames that can be further used for semi-supervised and self-supervised learning techniques. ZeroWaste also provides frames of the conveyor belt before and after the sorting process, comprising a novel setup that can be used for weakly-supervised segmentation.
Ruddit is a dataset of English language Reddit comments that has fine-grained, real-valued scores for offensive language detection between -1 (maximally supportive) and 1 (maximally offensive).
Repair AST parse (syntax) errors in Python code
Inolves an annotated a large number of cells, including normal epithelial and myoepithelial breast cells (localized in ducts and lobules), invasive carcinomatous cells, fibroblasts, endothelial cells, adipocytes, macrophages and inflammatory cells (lymphocytes and plasmocytes). In total, our data set consists of 50 images with a total of 4022 annotated cells, the maximum number of cells in one sample is 293 and the minimum number of cells in one sample is 5, with an average of 80 cells per sample and a high standard deviation of 58. The annotation was performed by three experts: an expert pathologist and two trained research fellows. Each sample was annotated by one of the annotators, checked by another one and in case of disagreement, a consensus was established by discussion among the 3 experts.
JRDB-Act is an extension of the JRDB dataset to create a large-scale multi-modal dataset for spatio-temporal action, social group and activity detection.