19,997 machine learning datasets
19,997 dataset results
The Musk2 dataset is a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. Each instance corresponds to a possible configuration of a molecule. The 166 features that describe these molecules depend upon the exact shape, or conformation, of the molecule.
The Middlebury 2001 is a stereo dataset of indoor scenes with multiple handcrafted layouts.
The images in DukeMTMC-attribute dataset comes from Duke University. There are 1812 identities and 34183 annotated bounding boxes in the DukeMTMC-attribute dataset. This dataset contains 702 identities for training and 1110 identities for testing, corresponding to 16522 and 17661 images respectively. The attributes are annotated in the identity level, every image in this dataset is annotated with 23 attributes.
DSTC Task 2 is a dataset and task for end-to-end conversation modeling. The goal is to generate conversational responses that go beyond trivial chitchat by injecting informative responses that are grounded in external knowledge. The data consists of conversational data from Reddit, and contextually-relevant “facts” taken from the website that started the Reddit conversation. That is the setup is grounded, as each conversation in the data is about a specific web page that was linked at the start of the conversation.
The Bach Doodle Dataset is composed of 21.6 million harmonizations submitted from the Bach Doodle. The dataset contains both metadata about the composition (such as the country of origin and feedback), as well as a MIDI of the user-entered melody and a MIDI of the generated harmonization. The dataset contains about 6 years of user entered music.
NIPS4Bplus is a richly annotated birdsong audio dataset, that is comprised of recordings containing bird vocalisations along with their active species tags plus the temporal annotations acquired for them. It consists of around 687 recordings, 87 classes, species tags, annotations. The total duration of audio is around 1 hour.
BCCD is a small-scale dataset for blood cells detection.
AndroidHowTo contains 32,436 data points from 9,893 unique How-To instructions and split into training (8K), validation (1K) and test (900). All test examples have perfect agreement across all three annotators for the entire sequence. In total, there are 190K operation spans, 172K object spans, and 321 input spans labeled. The lengths of the instructions range from 19 to 85 tokens, with median of 59. They describe a sequence of actions from one to 19 steps, with a median of 5.
CODA-19 is a human-annotated dataset that denotes the Background, Purpose, Method, Finding/Contribution, and Other for 10,966 English abstracts in the COVID-19 Open Research Dataset.
Real World Worry Dataset (RWWD) captures the emotional responses of UK residents to COVID-19 at a point in time where the impact of the COVID19 situation affected the lives of all individuals in the UK. The data were collected on the 6th and 7th of April 2020, a time at which the UK was under lockdown (news, 2020), and death tolls were increasing. On April 6, 5,373 people in the UK had died of the virus, and 51,608 tested positive. On the day before data collection, the Queen addressed the nation via a television broadcast. Furthermore, it was also announced that Prime Minister Boris Johnson was admitted to intensive care in a hospital for COVID-19 symptoms.
To construct the MICROSOFT RESEARCH MULTIMODAL ALIGNED RECIPE CORPUS the authors first extract a large number of text and video recipes from the web. The goal is to find joint alignments between multiple text recipes and multiple video recipes for the same dish. The task is challenging, as different recipes vary in their order of instructions and use of ingredients. Moreover, video instructions can be noisy, and text and video instructions include different levels of specificity in their descriptions.
MultiSense is a dataset of 9,504 images annotated with an English verb and its translation in Spanish and German.
PASTEL is a parallelly annotated stylistic language dataset. The dataset consists of ~41K parallel sentences and 8.3K parallel stories annotated across different personas.
OLPBENCH is a large Open Link Prediction benchmark, which was derived from the state-of-the-art Open Information Extraction corpus OPIEC (Gashteovski et al., 2019). OLPBENCH contains 30M open triples, 1M distinct open relations and 2.5M distinct mentions of approximately 800K entities.
STAIR Captions is a large-scale dataset containing 820,310 Japanese captions. This dataset can be used for caption generation, multimodal retrieval, and image generation.
The dataset contains information on what video segments a specific user considers a highlight. Having this kind of data allows for strong personalization models, as specific examples of what a user is interested in help models obtain a fine-grained understanding of that specific user.
The 3DNet dataset is a free resource for object class recognition and 6DOF pose estimation from point cloud data. 3DNet provides a large-scale hierarchical CAD-model databases with increasing numbers of classes and difficulty with 10, 60 and 200 object classes together with evaluation datasets that contain thousands of scenes captured with an RGB-D sensor.
The Lorenz dataset contains 100000 time-series with length 24. The data has 5 modes and it is obtained using the Lorenz equation with 5 different seed values.
CSAW-S is a dataset of mammography images which includes expert annotations of tumors and non-expert annotations of breast anatomy and artifacts in the image.
The FIGR-8 database is a dataset containing 17,375 classes of 1,548,256 images representing pictograms, ideograms, icons, emoticons or object or conception depictions. Its aim is to set a benchmark for Few-shot Image Generation tasks, albeit not being limited to it. Each image is represented by 192x192 pixels with grayscale value of 0-255. Classes are not balanced (they do not all contain the same number of elements), but they all do contain at the very least 8 images.