19,997 machine learning datasets
19,997 dataset results
Contains two different types: cloze-style reading comprehension and user query reading comprehension, associated with large-scale training data as well as human-annotated validation and hidden test set.
comma 2k19 is a dataset of over 33 hours of commute in California's 280 highway. This means 2019 segments, 1 minute long each, on a 20km section of highway driving between California's San Jose and San Francisco. The dataset was collected using comma EONs that have sensors similar to those of any modern smartphone including a road-facing camera, phone GPS, thermometers and a 9-axis IMU.
A new face annotation dataset with balanced distribution between genders and ethnic origins.
Fashion 144K is a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information.
FoodX-251 is a dataset of 251 fine-grained classes with 118k training, 12k validation and 28k test images. Human verified labels are made available for the training and test images. The classes are fine-grained and visually similar, for example, different types of cakes, sandwiches, puddings, soups, and pastas.
The GTA Indoor Motion dataset (GTA-IM) that emphasizes human-scene interactions in the indoor environments. It consists of HD RGB-D image sequences of 3D human motion from a realistic game engine. The dataset has clean 3D human pose and camera pose annotations, and large diversity in human appearances, indoor environments, camera views, and human activities.
A rich, extensible and efficient environment that contains 45,622 human-designed 3D scenes of visually realistic houses, ranging from single-room studios to multi-storied houses, equipped with a diverse set of fully labeled 3D objects, textures and scene layouts, based on the SUNCG dataset (Song et.al.)
A large-scale indoor layout dataset containing 35,357 2D floor plans including 252,550 rooms in total.
A large-scale Landmark guided face Parsing dataset (LaPa) for face parsing. It consists of more than 22,000 facial images with abundant variations in expression, pose and occlusion, and each image of LaPa is provided with a 11-category pixel-level label map and 106-point landmarks.
methods2test is a supervised dataset consisting of Test Cases and their corresponding Focal Methods from a set of Java software repositories. Methods2test was constructed by parsing the Java projects to obtain classes and methods with their associated metadata. Next each Test Class was matched to its corresponding Focal Class. Finally, each Test Case within a Test Class was mapped to the related Focal Method to obtain a set of Mapped Test Cases.
A data-set which consists of over one million images of physical 3D objects with seven factors of variation, such as object color, shape, size and position.
NCBI Datasets is a valuable resource that simplifies the process of gathering data from various NCBI databases. Whether you’re a researcher, scientist, or bioinformatician, NCBI Datasets provides an efficient way to access sequence information, annotations, and metadata for genes and genomes.
The NewSHead dataset contains 369,940 English stories with 932,571 unique URLs, among which there are 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.
ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual natural language processing (NLP) technologies face -- translation of coreference across languages. This corpus contains parallel texts for the language pair English-German, two major European languages. Despite being typologically very close, these languages still have systemic differences in the realisation of coreference, and thus pose problems for multilingual coreference resolution and machine translation. This parallel corpus covers the genres of planned speech (public lectures) and newswire. It is richly annotated for coreference in both languages, including annotation of both nominal coreference and reference to antecedents expressed as clauses, sentences and verb phrases.
Perspectrum is a dataset of claims, perspectives and evidence, making use of online debate websites to create the initial data collection, and augmenting it using search engines in order to expand and diversify the dataset. Crowd-sourcing was used to filter out noise and ensure high-quality data. The dataset contains 1k claims, accompanied with pools of 10k and 8k perspective sentences and evidence paragraphs, respectively.
A large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation.
ProtoQA is a question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers.
A large-scale dataset of ~29.5K rain/rain-free image pairs that covers a wide range of natural rain scenes.
Real-World Masked Face Dataset (RMFD) is a large dataset for masked face detection.
A collection that allows researchers to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results.