19,997 machine learning datasets
19,997 dataset results
EBM-NLP annotates PICO (Participants, Interventions, Comparisons and Outcomes) spans in clinical trial abstracts. The corresponding PICO Extraction task aims to identify the spans in clinical trial abstracts that describe the respective PICO elements.
The MAFL dataset contains manually annotated facial landmark locations for 19,000 training and 1,000 test images.
DialogRE is the first human-annotated dialogue-based relation extraction dataset, containing 1,788 dialogues originating from the complete transcripts of a famous American television situation comedy Friends. The are annotations for all occurrences of 36 possible relation types that exist between an argument pair in a dialogue. DialogRE is available in English and Chinese.
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.
stocknet-dataset This repository releases a comprehensive dataset for stock movement prediction from tweets and historical stock prices. Please cite the following paper [bib] if you use this dataset,
The MSLR-WEB10K dataset consists of 10,000 search queries over the documents from search results. The data also contains the values of 136 features and a corresponding user-labeled relevance factor on a scale of one to five with respect to each query-document pair. It is a subset of the MSLR-WEB30K dataset.
The Lakh MIDI dataset is a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. Its goal is to facilitate large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files). Around 10% of all MIDI files include timestamped lyrics events with lyrics are often transcribed at the word, syllable or character level.
The General Video Game AI (GVGAI) framework is widely used in research which features a corpus of over 100 single-player games and 60 two-player games. These are fairly small games, each focusing on specific mechanics or skills the players should be able to demonstrate, including clones of classic arcade games such as Space Invaders, puzzle games like Sokoban, adventure games like Zelda or game-theory problems such as the Iterative Prisoners Dilemma. All games are real-time and require players to make decisions in only 40ms at every game tick, although not all games explicitly reward or require fast reactions; in fact, some of the best game-playing approaches add up the time in the beginning of the game to run Breadth-First Search in puzzle games in order to find an accurate solution. However, given the large variety of games (many of which are stochastic and difficult to predict accurately), scoring systems and termination conditions, all unknown to the players, highly-adaptive genera
PanLex translates words in thousands of languages. Its database is panlingual (emphasizes coverage of every language) and lexical (focuses on words, not sentences).
The color FERET database is a dataset for face recognition. It contains 11,338 color images of size 512×768 pixels captured in a semi-controlled environment with 13 different poses from 994 subjects.
BRATS 2013 is a brain tumor segmentation dataset consists of synthetic and real images, where each of them is further divided into high-grade gliomas (HG) and low-grade gliomas (LG). There are 25 patients with both synthetic HG and LG images and 20 patients with real HG and 10 patients with real LG images. For each patient, FLAIR, T1, T2, and post-Gadolinium T1 magnetic resonance (MR) image sequences are available.
DRealSR establishes a Super Resolution (SR) benchmark with diverse real-world degradation processes, mitigating the limitations of conventional simulated image degradation.
Fashion-Gen consists of 293,008 high definition (1360 x 1360 pixels) fashion images paired with item descriptions provided by professional stylists. Each item is photographed from a variety of angles.
JTA is a dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios.
The Open Table-and-Text Question Answering (OTT-QA) dataset contains open questions which require retrieving tables and text from the web to answer. This dataset is re-annotated from the previous HybridQA dataset. The dataset is collected by UCSB NLP group and issued under MIT license.
Enables detailed human body model reconstruction in clothing from a single monocular RGB video without requiring a pre scanned template or manually clicked points.
SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.
VisualMRC is a visual machine reading comprehension dataset that proposes a task: given a question and a document image, a model produces an abstractive answer.
A hand-object interaction dataset with 3D pose annotations of hand and object. The dataset contains 66,034 training images and 11,524 test images from a total of 68 sequences. The sequences are captured in multi-camera and single-camera setups and contain 10 different subjects manipulating 10 different objects from YCB dataset. The annotations are automatically obtained using an optimization algorithm. The hand pose annotations for the test set are withheld and the accuracy of the algorithms on the test set can be evaluated with standard metrics using the CodaLab challenge submission(see project page). The object pose annotations for the test and train set are provided along with the dataset.
3DSSG provides 3D semantic scene graphs for 3RScan. A semantic scene graph is defined by a set of tuples between nodes and edges where nodes represent specific 3D object instances in a 3D scan. Nodes are defined by its semantics, a hierarchy of classes as well as a set of attributes that describe the visual and physical appearance of the object instance and their affordances. The edges in our graphs are the semantic relationships (predicates) between the nodes such as standing on, hanging on, more comfortable than or same material.