Datasets

19,997 machine learning datasets

19,997 dataset results

CMRC 2017 (Chinese Machine Reading Comprehension 2017)

Contains two different types: cloze-style reading comprehension and user query reading comprehension, associated with large-scale training data as well as human-annotated validation and hidden test set.

11 papers0 benchmarks

comma 2k19

comma 2k19 is a dataset of over 33 hours of commute in California's 280 highway. This means 2019 segments, 1 minute long each, on a 20km section of highway driving between California's San Jose and San Francisco. The dataset was collected using comma EONs that have sensors similar to those of any modern smartphone including a road-facing camera, phone GPS, thermometers and a 9-axis IMU.

11 papers0 benchmarks

DiveFace

A new face annotation dataset with balanced distribution between genders and ethnic origins.

11 papers6 benchmarks

Fashion 144K

Fashion 144K is a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information.

11 papers0 benchmarks

FoodX-251

FoodX-251 is a dataset of 251 fine-grained classes with 118k training, 12k validation and 28k test images. Human verified labels are made available for the training and test images. The classes are fine-grained and visually similar, for example, different types of cakes, sandwiches, puddings, soups, and pastas.

11 papers2 benchmarksImages

GTA-IM Dataset (GTA Indoor Motion)

The GTA Indoor Motion dataset (GTA-IM) that emphasizes human-scene interactions in the indoor environments. It consists of HD RGB-D image sequences of 3D human motion from a realistic game engine. The dataset has clean 3D human pose and camera pose annotations, and large diversity in human appearances, indoor environments, camera views, and human activities.

11 papers9 benchmarksImages, Videos

House3D Environment

A rich, extensible and efficient environment that contains 45,622 human-designed 3D scenes of visually realistic houses, ranging from single-room studios to multi-storied houses, equipped with a diverse set of fully labeled 3D objects, textures and scene layouts, based on the SUNCG dataset (Song et.al.)

11 papers0 benchmarks3D, Environment

HouseExpo

A large-scale indoor layout dataset containing 35,357 2D floor plans including 252,550 rooms in total.

11 papers0 benchmarks

LaPa

A large-scale Landmark guided face Parsing dataset (LaPa) for face parsing. It consists of more than 22,000 facial images with abundant variations in expression, pose and occlusion, and each image of LaPa is provided with a 11-category pixel-level label map and 106-point landmarks.

11 papers2 benchmarksImages

methods2test

methods2test is a supervised dataset consisting of Test Cases and their corresponding Focal Methods from a set of Java software repositories. Methods2test was constructed by parsing the Java projects to obtain classes and methods with their associated metadata. Next each Test Class was matched to its corresponding Focal Class. Finally, each Test Case within a Test Class was mapped to the related Focal Method to obtain a set of Mapped Test Cases.

11 papers0 benchmarksTexts

MPI3D Disentanglement

A data-set which consists of over one million images of physical 3D objects with seven factors of variation, such as object color, shape, size and position.

11 papers0 benchmarksImages

NCBI Datasets

NCBI Datasets is a valuable resource that simplifies the process of gathering data from various NCBI databases. Whether you’re a researcher, scientist, or bioinformatician, NCBI Datasets provides an efficient way to access sequence information, annotations, and metadata for genes and genomes.

11 papers0 benchmarks

NewSHead

The NewSHead dataset contains 369,940 English stories with 932,571 unique URLs, among which there are 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.

11 papers1 benchmarks

ParCorFull (Parallel Corpus Annotated with Full Coreference)

ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual natural language processing (NLP) technologies face -- translation of coreference across languages. This corpus contains parallel texts for the language pair English-German, two major European languages. Despite being typologically very close, these languages still have systemic differences in the realisation of coreference, and thus pose problems for multilingual coreference resolution and machine translation. This parallel corpus covers the genres of planned speech (public lectures) and newswire. It is richly annotated for coreference in both languages, including annotation of both nominal coreference and reference to antecedents expressed as clauses, sentences and verb phrases.

11 papers0 benchmarksTexts

Perspectrum

Perspectrum is a dataset of claims, perspectives and evidence, making use of online debate websites to create the initial data collection, and augmenting it using search engines in order to expand and diversify the dataset. Crowd-sourcing was used to filter out noise and ensure high-quality data. The dataset contains 1k claims, accompanied with pools of 10k and 8k perspective sentences and evidence paragraphs, respectively.

11 papers1 benchmarks

PhotoBook

A large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation.

11 papers0 benchmarksImages, Texts

ProtoQA

ProtoQA is a question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers.

11 papers0 benchmarksTexts

PreviousPage 144 of 1000Next

Datasets

CMRC 2017 (Chinese Machine Reading Comprehension 2017)

comma 2k19

DiveFace

Fashion 144K

FoodX-251

GTA-IM Dataset (GTA Indoor Motion)

House3D Environment

HouseExpo

LaPa

methods2test

MPI3D Disentanglement

NCBI Datasets

NewSHead

ParCorFull (Parallel Corpus Annotated with Full Coreference)

Perspectrum

PhotoBook

ProtoQA

Real Rain Dataset

RMFD (Real-World Masked Face Dataset)

SBU Captions Dataset

Datasets

CMRC 2017 (Chinese Machine Reading Comprehension 2017)

comma 2k19

DiveFace

Fashion 144K

FoodX-251

GTA-IM Dataset (GTA Indoor Motion)

House3D Environment

HouseExpo

LaPa

methods2test

MPI3D Disentanglement

NCBI Datasets

NewSHead

ParCorFull (Parallel Corpus Annotated with Full Coreference)

Perspectrum

PhotoBook

ProtoQA

Real Rain Dataset

RMFD (Real-World Masked Face Dataset)

SBU Captions Dataset