Datasets

19,997 machine learning datasets

19,997 dataset results

Musk v2

The Musk2 dataset is a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. Each instance corresponds to a possible configuration of a molecule. The 166 features that describe these molecules depend upon the exact shape, or conformation, of the molecule.

4 papers2 benchmarksTabular

Middlebury 2001

The Middlebury 2001 is a stereo dataset of indoor scenes with multiple handcrafted layouts.

4 papers0 benchmarksImages, Stereo

DukeMTMC-attribute

The images in DukeMTMC-attribute dataset comes from Duke University. There are 1812 identities and 34183 annotated bounding boxes in the DukeMTMC-attribute dataset. This dataset contains 702 identities for training and 1110 identities for testing, corresponding to 16522 and 17661 images respectively. The attributes are annotated in the identity level, every image in this dataset is annotated with 23 attributes.

4 papers2 benchmarksImages, Texts, Videos

DSTC7 Task 2 (Dialog System Technology Challenges Task 2)

DSTC Task 2 is a dataset and task for end-to-end conversation modeling. The goal is to generate conversational responses that go beyond trivial chitchat by injecting informative responses that are grounded in external knowledge. The data consists of conversational data from Reddit, and contextually-relevant “facts” taken from the website that started the Reddit conversation. That is the setup is grounded, as each conversation in the data is about a specific web page that was linked at the start of the conversation.

4 papers0 benchmarksTexts

Bach Doodle

The Bach Doodle Dataset is composed of 21.6 million harmonizations submitted from the Bach Doodle. The dataset contains both metadata about the composition (such as the country of origin and feedback), as well as a MIDI of the user-entered melody and a MIDI of the generated harmonization. The dataset contains about 6 years of user entered music.

4 papers0 benchmarksAudio

NIPS4Bplus

NIPS4Bplus is a richly annotated birdsong audio dataset, that is comprised of recordings containing bird vocalisations along with their active species tags plus the temporal annotations acquired for them. It consists of around 687 recordings, 87 classes, species tags, annotations. The total duration of audio is around 1 hour.

4 papers0 benchmarksAudio

BCCD

BCCD is a small-scale dataset for blood cells detection.

4 papers0 benchmarksImages

AndroidHowTo

AndroidHowTo contains 32,436 data points from 9,893 unique How-To instructions and split into training (8K), validation (1K) and test (900). All test examples have perfect agreement across all three annotators for the entire sequence. In total, there are 190K operation spans, 172K object spans, and 321 input spans labeled. The lengths of the instructions range from 19 to 85 tokens, with median of 59. They describe a sequence of actions from one to 19 steps, with a median of 5.

4 papers0 benchmarksTexts

CODA-19

CODA-19 is a human-annotated dataset that denotes the Background, Purpose, Method, Finding/Contribution, and Other for 10,966 English abstracts in the COVID-19 Open Research Dataset.

4 papers0 benchmarksMedical, Texts

RWWD (Real World Worry Dataset)

Real World Worry Dataset (RWWD) captures the emotional responses of UK residents to COVID-19 at a point in time where the impact of the COVID19 situation affected the lives of all individuals in the UK. The data were collected on the 6th and 7th of April 2020, a time at which the UK was under lockdown (news, 2020), and death tolls were increasing. On April 6, 5,373 people in the UK had died of the virus, and 51,608 tested positive. On the day before data collection, the Queen addressed the nation via a television broadcast. Furthermore, it was also announced that Prime Minister Boris Johnson was admitted to intensive care in a hospital for COVID-19 symptoms.

4 papers0 benchmarksTexts

Microsoft Research Multimodal Aligned Recipe Corpus

To construct the MICROSOFT RESEARCH MULTIMODAL ALIGNED RECIPE CORPUS the authors first extract a large number of text and video recipes from the web. The goal is to find joint alignments between multiple text recipes and multiple video recipes for the same dish. The task is challenging, as different recipes vary in their order of instructions and use of ingredients. Moreover, video instructions can be noisy, and text and video instructions include different levels of specificity in their descriptions.

4 papers0 benchmarksTexts

MultiSense

MultiSense is a dataset of 9,504 images annotated with an English verb and its translation in Spanish and German.

4 papers0 benchmarksImages, Texts

PASTEL

PASTEL is a parallelly annotated stylistic language dataset. The dataset consists of ~41K parallel sentences and 8.3K parallel stories annotated across different personas.

4 papers0 benchmarksTexts

OLPBENCH

OLPBENCH is a large Open Link Prediction benchmark, which was derived from the state-of-the-art Open Information Extraction corpus OPIEC (Gashteovski et al., 2019). OLPBENCH contains 30M open triples, 1M distinct open relations and 2.5M distinct mentions of approximately 800K entities.

4 papers0 benchmarksTexts

STAIR Captions

STAIR Captions is a large-scale dataset containing 820,310 Japanese captions. This dataset can be used for caption generation, multimodal retrieval, and image generation.

4 papers0 benchmarksImages, Texts

PHD² (Personalized Highlight Detection Dataset)

The dataset contains information on what video segments a specific user considers a highlight. Having this kind of data allows for strong personalization models, as specific examples of what a user is interested in help models obtain a fine-grained understanding of that specific user.

4 papers0 benchmarksVideos

3DNet

The 3DNet dataset is a free resource for object class recognition and 6DOF pose estimation from point cloud data. 3DNet provides a large-scale hierarchical CAD-model databases with increasing numbers of classes and difficulty with 10, 60 and 200 object classes together with evaluation datasets that contain thousands of scenes captured with an RGB-D sensor.

4 papers0 benchmarksImages

Lorenz Dataset

The Lorenz dataset contains 100000 time-series with length 24. The data has 5 modes and it is obtained using the Lorenz equation with 5 different seed values.

4 papers0 benchmarksTime series

CSAW-S

CSAW-S is a dataset of mammography images which includes expert annotations of tumors and non-expert annotations of breast anatomy and artifacts in the image.

4 papers0 benchmarksImages

FIGR-8

The FIGR-8 database is a dataset containing 17,375 classes of 1,548,256 images representing pictograms, ideograms, icons, emoticons or object or conception depictions. Its aim is to set a benchmark for Few-shot Image Generation tasks, albeit not being limited to it. Each image is represented by 192x192 pixels with grayscale value of 0-255. Classes are not balanced (they do not all contain the same number of elements), but they all do contain at the very least 8 images.

4 papers0 benchmarksImages

PreviousPage 231 of 1000Next