19,997 machine learning datasets
19,997 dataset results
Multimodal Lecture Presentations (MLP) is a large-scale benchmark dataset for testing the capabilities of machine learning models in multimodal understanding of educational content. To benchmark the understanding of multimodal information in lecture slides, two research tasks are introduced; they are designed to be a first step towards developing AI that can explain and illustrate lecture slides: automatic retrieval of (1) spoken explanations for an educational figure (Figure-to-Text) and (2) illustrations to accompany a spoken explanation (Text-to-Figure).
CodeQueries Benchmark dataset consists of instances of semantic queries, code context and code spans in the context corresponding to the semantic queries. The dataset can be used in experiments involving semantic query comprehension with an extractive question-answering methodology over code. More details can be found in the paper.
Texture-based studies and designs have been in focus recently. Whisker-based multidimensional surface texture data is missing in the literature. This data is critical for robotics and machine perception algorithms in the classification and regression of textural surfaces. We present a novel sensor design to acquire multidimensional texture information. The surface texture's roughness and hardness were measured experimentally using sweeping and dabbing. The data is made available to the research community for further advancing texture perception studies.
MIDGARD is an open-source simulator for autonomous robot navigation in outdoor unstructured environments. It is designed to enable the training of autonomous agents (e.g., unmanned ground vehicles) in photorealistic 3D environments, and support the generalization skills of learning-based agents thanks to the variability in training scenarios.
A set of 221 stereo videos captured by the SOCRATES stereo camera trap in a wildlife park in Bonn, Germany between February and July of 2022. A subset of frames is labeled with instance annotations in the COCO format.
IHDS is a nationally representative, multi-topic panel survey of 41,554 households in 1503 villages and 971 urban neighborhoods across India.
InLegalNER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.
The Gun Violence Corpus (GVC) consists of 241 unique incidents for which we have structured data on a) location, b) time c) the name, gender and age of the victims and d) the status of the victims after the incident: killed or injured. For these data, 510 news articles were gathered following the 'data to text' approach. The structured data and articles report on a variety of gun violence incidents, such as drive-by shootings, murder-suicides, hunting accidents, involuntary gun discharges, etcetera. The documents have been manually annotated for all mentions that make reference to the gun violence incident at hand.
DifferSketching is a dataset of freehand sketches to understand how differently professional and novice users sketch 3D objects. It includes 3,620 freehand multi-view sketches registered with their corresponding 3D objects. To date, the dataset is an order of magnitude larger than the existing datasets.
The data used in this research is a subset of the Multi-parameter Intelligent Monitoring for Critical Care (MIMIC) II database. It contains minute-by-minute time series of Heart Rate (HR), Systolic Blood Pressure (SBP), Diastolic Blood Pressure (DBP), and Mean Arterial blood Pressure (MAP) arranged into records, each of which corresponds to an adult patient’s ICU stay.
Natural Language Inference processes pairs of sentences to extract their semantic relations. Pair sentences are annotated with three classes (Contradictions, Entailment, Neutral).
A dataset of 53 complex-valued signal modulation classes.
Music4All-Onion is a large-scale, multi-modal music dataset that expands the Music4All dataset by including 26 additional audio, video, and metadata features for 109,269 music pieces and provides a set of 252,984,396 listening records of 119,140 users, extracted from the online music platform Last.fm .
The resources for this dataset can be found at https://www.openml.org/d/182
ImDrug is a comprehensive benchmark with an open-source Python library which consists of 4 imbalance settings, 11 AI-ready datasets, 54 learning tasks and 16 baseline algorithms tailored for imbalanced learning. It features modularized components including formulation of learning setting and tasks, dataset curation, standardized evaluation, and baseline algorithms. It also provides an accessible and customizable testbed for problems and solutions spanning a broad spectrum of the drug discovery pipeline such as molecular modeling, drug-target interaction and retrosynthesis.
Dataset page: https://github.com/mosamdabhi/MBW-Data
This is a large-scale dataset of quantum-mechanically calculated properties (DFT level) of crystalline materials for graph representation learning that contains approximately 900k entries (OQM9HK). This dataset is constructed on the basis of the Open Quantum Materials Database (OQMD) v1.5 containing more than one million entries, and is the successor to the OQMD v1.2 dataset containing approximately 600k entries (OQM6HK).
NVSA is a large-scale NBA dataset for Sports Video Analysis (NSVA) with a focus on sports video captioning. This dataset consists of more than 32K video clips and it is also designed to address two additional tasks, namely fine-grained sports action recognition and salient player identification.
PAL4Inpaint is a dataset consisting of 4,795 inpainting results with per-pixel perceptual artifacts annotations designed for image inpainting tasks.
FormulaNet FormulaNet is a new large-scale Mathematical Formula Detection dataset. It consists of 46'672 pages of STEM documents from arXiv and has 13 types of labels. The dataset is split into a train set of 44'338 pages and a validation set of 2'334 pages. Due to copyrights reasons, we can only provide the list of papers, which must be downloaded and processed.