19,997 machine learning datasets
19,997 dataset results
The Snow100K dataset consists of
SiW provides live and spoof videos from 165 subjects. For each subject, we have 8 live and up to 20 spoof videos, in total 4,478 videos. All videos are in 30 fps, about 15 second length, and 1080P HD resolution. The live videos are collected in four sessions with variations of distance, pose, illumination and expression. The spoof videos are collected with several attacks such as printed paper and replay.
The TOFU dataset serves as a benchmark for evaluating unlearning performance of large language models on realistic tasks. The dataset comprises question-answer pairs based on autobiographies of 200 different authors that do not exist and are completely fictitiously generated by the GPT-4 model. The goal of the task is to unlearn a fine-tuned model on various fractions of the forget set.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries.
NomBank is an annotation project at New York University that is related to the PropBank project at the University of Colorado. The goal is to mark the sets of arguments that cooccur with nouns in the PropBank Corpus (the Wall Street Journal Corpus of the Penn Treebank), just as PropBank records such information for verbs. As a side effect of the annotation process, the authors are producing a number of other resources including various dictionaries, as well as PropBank style lexical entries called frame files. These resources help the user label the various arguments and adjuncts of the head nouns with roles (sets of argument labels for each sense of each noun). NYU and U of Colorado are making a coordinated effort to insure that, when possible, role definitions are consistent across parts of speech. For example, PropBank's frame file for the verb "decide" was used in the annotation of the noun "decision".
Quoref is a QA dataset which tests the coreferential reasoning capability of reading comprehension systems. In this span-selection benchmark containing 24K questions over 4.7K paragraphs from Wikipedia, a system must resolve hard coreferences before selecting the appropriate span(s) in the paragraphs for answering questions.
The InterHand2.6M dataset is a large-scale real-captured dataset with accurate GT 3D interacting hand poses, used for 3D hand pose estimation The dataset contains 2.6M labeled single and interacting hand frames.
emrQA has 1 million question-logical form and 400,000+ questionanswer evidence pairs.
DrivingStereo contains over 180k images covering a diverse set of driving scenarios, which is hundreds of times larger than the KITTI Stereo dataset. High-quality labels of disparity are produced by a model-guided filtering strategy from multi-frame LiDAR points.
FLoRes is a benchmark dataset for machine translation between English and four low-resource languages, Nepali, Sinhala, Khmer, and Pashto, based on sentences translated from Wikipedia. The FLoRes project has two versions: FLoRes-101 and FLoRes-200.
The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.
PlotQA is a VQA dataset with 28.9 million question-answer pairs grounded over 224,377 plots on data from real-world sources and questions based on crowd-sourced question templates. Existing synthetic datasets (FigureQA, DVQA) for reasoning over plots do not contain variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. In particular, they assume that the answer comes either from a small fixed size vocabulary or from a bounding box within the image. However, in practice this is an unrealistic assumption because many questions require reasoning and thus have real valued answers which appear neither in a small fixed size vocabulary nor in the image. In this work, we aim to bridge this gap between existing datasets and real world plots by introducing PlotQA. Further, 80.76% of the out-of-vocabulary (OOV) questions in PlotQA have answers that are not in a fixed
This repository contains the publicly released dataset, code, and models for the Explanations for CommonsenseQA paper presented at ACL-IJCNLP 2021. Directories data and code inside the root folder contain dataset and code, respectively. The same data and code are also made available through our AIHN collaboration partner institute IIT Delhi. You can download the full paper from here.
Lesion segmentation data includes the original image, paired with the expert manual tracing of the lesion boundaries in the form of a binary mask. The Training Data file is a ZIP file, containing 900 dermoscopic lesion images in JPEG format. All images are named using the scheme ISIC_<image_id>.jpg, where <image_id> is a 7-digit unique identifier. EXIF tags in the images have been removed; any remaining EXIF tags should not be relied upon to provide accurate metadata. The Training Ground Truth file is a ZIP file, containing 900 binary mask images in PNG format. All masks are named using the scheme ISIC_<image_id>_Segmentation.png, where <image_id> matches the corresponding Training Data image for the mask. All mask images will have the exact same dimensions as their corresponding lesion image. Mask images are encoded as single-channel (grayscale) 8-bit PNGs (to provide lossless compression), where each pixel is either:
The Pavia University dataset is a hyperspectral image dataset which gathered by a sensor known as the reflective optics system imaging spectrometer (ROSIS-3) over the city of Pavia, Italy. The image consists of 610×340 pixels with 115 spectral bands. The image is divided into 9 classes with a total of 42,776 labelled samples, including the asphalt, meadows, gravel, trees, metal sheet, bare soil, bitumen, brick, and shadow.
FSS-1000 is a 1000 class dataset for few-shot segmentation. The dataset contains significant number of objects that have never been seen or annotated in previous datasets, such as tiny daily objects, merchandise, cartoon characters, logos, etc.
Letter Recognition Data Set is a handwritten digit dataset. The task is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15.
This dataset contains a large number of segmented nuclei images. The images were acquired under a variety of conditions and vary in the cell type, magnification, and imaging modality (brightfield vs. fluorescence). The dataset is designed to challenge an algorithm's ability to generalize across these variations.
The Foursquare dataset consists of check-in data for different cities. One subset contains check-ins in NYC and Tokyo collected for about 10 month (from 12 April 2012 to 16 February 2013). It contains 227,428 check-ins in New York city and 573,703 check-ins in Tokyo. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning (represented by fine-grained venue-categories). Another subset contains long-term (about 18 months from April 2012 to September 2013) global-scale check-in data collected from Foursquare. It contains 33,278,683 checkins by 266,909 users on 3,680,126 venues (in 415 cities in 77 countries). Those 415 cities are the most checked 415 cities by Foursquare users in the world, each of which contains at least 10K check-ins.