19,997 machine learning datasets
19,997 dataset results
We collect a large-scale synthetic dataset for robotic hands with Differentiable Force Closure(DFC). It covers 436,000 diverse and stable grasps for 58 household objects from ContactDB and YCB datasets across 5 robotic hands including EZGripper, Barrett Hand, Robotiq-3Finger, Allegro Hand and Shadowhand.
DIBCO 2017 is the international Competition on Document Image Binarization organized in conjunction with the ICDAR 2017 conference. The general objective of the contest is to identify current advances in document image binarization of machine-printed and handwritten document images using performance evaluation measures that are motivated by document image analysis and recognition requirements
H-DIBCO 2014 is the International Document Image Binarization Competition which is dedicated to handwritten document images organized in conjunction with ICFHR 2014 conference. The objective of the contest is to identify current advances in handwritten document image binarization using meaningful evaluation performance measures.
H-DIBCO 2018 is the international Handwritten Document Image Binarization Contest organized in the context of ICFHR 2018 conference. The general objective of the contest is to record recent advances in document image binarization using established evaluation performance measures.
SpeechMatrix is a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech.
This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.
QDax is a benchmark suite designed for for Deep Neuroevolution in Reinforcement Learning domains for robot control. The suite includes the definition of tasks, environments, behavioral descriptors, and fitness. It specify different benchmarks based on the complexity of both the task and the agent controlled by a deep neural network. The benchmark uses standard Quality-Diversity metrics, including coverage, QD-score, maximum fitness, and an archive profile metric to quantify the relation between coverage and fitness.
The Malimg Dataset contains 9,339 malware byteplot images from 25 different families.
legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.
CELLS is a large (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of the dataset.
MCXFace is a heterogeneous face recognition dataset consisting of multi-channel image samples for 51 subjects. For each subject color (RGB), thermal, near-infrared (850 nm), short-wave infrared (1300 nm), Depth, Stereo depth, and depth estimated from RGB images are available. Overall 7406 images together with landmark annotations and standard protocols are available in this dataset.
KAMEL comprises knowledge about 234 relations from Wikidata with a large training, validation, and test dataset. We make sure that all facts are also present in Wikipedia so that they have been seen during the pre-training procedure of the LMs we are probing. Most importantly we overcome the limitations of existing probing datasets by (1) having a larger variety of knowledge graph relations, (2) it contains single- and multi-token entities, (3) we use relations with literals, and (4) have alternative labels for entities. (5) Furthermore, we created an evaluation procedure for higher cardinality relations, which was missing in previous works, and (6) make sure that the dataset can be used for causal LMs.
Smart meter roll-outs provide easy access to granular meter measurements, enabling advanced energy services, ranging from demand response measures, tailored energy feedback and smart home/building automation. To design such services, train and validate models, access to data that resembles what is expected of smart meters, collected in a real-world setting, is necessary. The REFIT electrical load measurements dataset described in this paper includes whole house aggregate loads and nine individual appliance measurements at 8-second intervals per house, collected continuously over a period of two years from 20 houses. During monitoring, the occupants were conducting their usual routines. At the time of publishing, the dataset has the largest number of houses monitored in the United Kingdom at less than 1-minute intervals over a period greater than one year. The dataset comprises 1,194,958,790 readings, that represent over 250,000 monitored appliance uses. The data is accessible in an eas
ArmanEmo is a human-labeled emotion dataset of more than 7000 Persian sentences labeled for seven categories. The dataset has been collected from different resources, including Twitter, Instagram, and Digikala (an Iranian e-commerce company) comments. Labels are based on Ekman's six basic emotions (Anger, Fear, Happiness, Hatred, Sadness, Wonder) and another category (Other) to consider any other emotion not included in Ekman's model.
Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.
Wild-Time is a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. On these datasets, we systematically benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning.
DRTiD is a benchmark dataset for DR grading, consisting of 3,100 two-field fundus images.
Plain Language Adaptation of Biomedical Abstracts (PLABA) is a dataset designed for automatic adaptation that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts, totaling 7643 sentence pairs.
DocCVQA is a Document Visual Question Answering dataset, where the questions are posed over a whole collection of 14,362 scanned documents. Therefore, the task can be seen as a retrieval-style evidence seeking task where given a question, the aim is to identify and retrieve all the documents in a large document collection that are relevant to answering this question as well as provide the answer.
The SWIMSEG dataset contains 1013 images of sky/cloud patches, along with their corresponding binary segmentation maps. The ground truth annotation was done in consultation with experts from Singapore Meteorological Services. All images were captured in Singapore using WAHRSIS, a calibrated ground-based whole sky imager, over a period of 22 months from October 2013 to July 2015. Each patch covers about 60-70 degrees of the sky with a resolution of 600x600 pixels.