19,997 machine learning datasets
19,997 dataset results
KRAUTS (Korpus of newspapeR Articles with Underlinded Temporal expressionS) is a German temporally annotated news corpus accompanied with TimeML annotation guidelines for German. It was developed at Fondazione Bruno Kessler, Trento, Italy and at the Max Planck Institute for Informatics, Saarbrücken, Germany. Our goal is to boost temporal tagging research for German.
TimeBankPT is a corpus of Portuguese text with annotations about time. The annotation scheme used is similar to TimeML. TimeBankPT is the result of adapting the English corpus used in the first TempEval challenge to the Portuguese language.
The MoCapAct dataset contains training data and models for humanoid locomotion research. It consists of expert policies that are trained to track individual clip snippets and HDF5 files of noisy rollouts collected from each expert, including proprioceptive observations and actions.
A dataset of real distributional shift across multiple large-scale tasks.
The first NER dataset in the field of traffic, which is to extract the characteristics and attributes of the vehicle on the road.
DRAW-1K is a dataset consisting of 1000 algebra word problems, semiautomatically annotated for the evaluation of automatic solvers. DRAW includes gold coefficient alignments that are necessary uniquely identify the derivation of an equation system.
GAMMA releases the world's first multi-modal dataset for glaucoma grading, which was provided by the Sun Yat-sen Ophthalmic Center of Sun Yat-sen University in Guangzhou, China. The dataset consists of 2D fundus images and 3D optical coherence tomography (OCT) images of 300 patients. The dataset was annotated with glaucoma grade in every sample, and macular fovea coordinates as well as optic disc/cup segmentation mask in the fundus image.
KAIST-Lane (K-Lane) is the world’s first and the largest public urban road and highway lane dataset for Lidar. K-Lane has more than 15K frames and contains annotations of up to six lanes under various road and traffic conditions, e.g., occluded roads of multiple occlusion levels, roads at day and night times, merging (converging and diverging) and curved lanes.
The MUAD dataset (Multiple Uncertainties for Autonomous Driving), consisting of 10,413 realistic synthetic images with diverse adverse weather conditions (night, fog, rain, snow), out-of-distribution objects, and annotations for semantic segmentation, depth estimation, object, and instance detection. Predictive uncertainty estimation is essential for the safe deployment of Deep Neural Networks in real-world autonomous systems and MUAD allows to a better assess the impact of different sources of uncertainty on model performance.
DPB-5L is a Multilingual KG dataset containing 5 KGs in English, French, Japanese, Greek, and Spanish. The dataset is used for the Knowledge Graph Completion and Entity Alignment task. DPB-5L (Greek) is a subset of DPB-5L with Greek KG.
FRMT is a dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of human translations of a few thousand English Wikipedia sentences into regional variants of Portuguese and Mandarin. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distractor terms.
Diamante is a novel and efficient framework consisting of a data collection strategy and a learning method to boost the performance of pre-trained dialogue models. Two kinds of human feedback are collected and leveraged in Diamante, including explicit demonstration and implicit preference. The Diamante dataset is publicly available at the LUGE platform.
These images were generated using Blender and IEE-Simulator with different head-poses, where the images are labelled according to nine classes (straight, turned bottom-left, turned left, turned top-left, turned bottom-right, turned right, turned top-right, reclined, looking up). The dataset contains 16,013 training images and 2,825 testing images, in addition to 4,700 images for improvements.
Separated COCO is automatically generated subsets of COCO val dataset, collecting separated objects for a large variety of categories in real images in a scalable manner, where target object segmentation mask is separated into distinct regions by the occluder.
ALTO is a vision-focused dataset for the development and benchmarking of Visual Place Recognition and Localization methods for Unmanned Aerial Vehicles. The dataset is composed of two long (approximately 150km and 260km) trajectories flown by a helicopter over Ohio and Pennsylvania, and it includes high precision GPS-INS ground truth location data, high precision accelerometer readings, laser altimeter readings, and RGB downward facing camera imagery.The dataset also comes with reference imagery over the flight paths, which makes this dataset suitable for VPR benchmarking and other tasks common in Localization, such as image registration and visual odometry.
DiSCQ is a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. This dataset is released to facilitate further research into realistic clinical Question Answering (QA) and Question Generation (QG).
CS1QA is a dataset for code-based question answering in the programming education domain. It consists of 9,237 question-answer pairs gathered from chat logs in an introductory programming class using Python, and 17,698 unannotated chat data with code.
PDEBench provides a diverse and comprehensive set of benchmarks for scientific machine learning, including challenging and realistic physical problems. The repository consists of the code used to generate the datasets, to upload and download the datasets from the data repository, as well as to train and evaluate different machine learning models as baseline. PDEBench features a much wider range of PDEs than existing benchmarks and included realistic and difficult problems (both forward and inverse), larger ready-to-use datasets comprising various initial and boundary conditions, and PDE parameters. Moreover, PDEBench was crated to make the source code extensible and we invite active participation to improve and extent the benchmark.
Machine-learning Data Set Prepared from NASA Solar Dynamics Observatory Mission data.
We have characterized 1000 human cancer cell lines and screened them with 100s of compounds. On this website, you will find drug response data and genomic markers of sensitivity.