19,997 machine learning datasets
19,997 dataset results
CochlScene is a dataset for acoustic scene classification. The dataset consists of 76k samples collected from 831 participants in 13 acoustic scenes.
Demetr is a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories.
KITTI360-EX is a dataset for outer- and inner FoV expansion. It contains 76k pinhole images as well as 76k spherical images and is used for beyond-FoV estimation.
HyperRED is a dataset for the new task of hyper-relational extraction, which extracts relation triplets together with qualifier information such as time, quantity or location. For example, the relation triplet (Leonard Parker, Educated At, Harvard University) can be factually enriched by including the qualifier (End Time, 1967). HyperRED contains 44k sentences with 62 relation types and 44 qualifier types.
Qualitative dataset with real blurred videos, created by using beam-splitter setup in lab environment
CEFR-SP contains 17k English sentences annotated with the levels based on the Common European Framework of Reference for Languages assigned by English-education professionals.
Don’t Patronize Me! (DPM) is an annotated dataset with Patronizing and Condescending Language towards vulnerable communities.
FPv1 (prior name FAUST-partial) is a 3D registration benchmark dataset created to address the lack of data variability in the existing 3D registration benchmarks such as: 3DMatch, ETH, KITTI.
Overview LEVEN is the largest Legal Event Detection dataset as well as the largest Chinese Event Detection dataset.
The MMBody dataset provides human body data with motion capture, GT mesh, Kinect RGBD, and millimeter wave sensor data. See homepage for more details.
Human Bodies in the Wild (HBW) is a validation and test set for body shape estimation. It consists of images taken in the wild and ground truth 3D body scans in SMPL-X topology. To create HBW, we collect body scans of 35 participants and register the SMPL-X model to the scans. Further each participant is photographed in various outfits and poses in front of a white background and uploads full-body photos of themselves taken in the wild. The validation and test set images are released. The ground truth shape is only released for the validation set.
JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese.
A chemical synthesis route dataset constructed from the USPTO reaction dataset (1976-Sep2016) and a list of commercially available building blocks from eMolecules (~23.1M molecules). After processing, the dataset has 299202 training routes, 65274 validation routes, 190 test routes, and the corresponding target molecules.
GeoDE is a geographically diverse dataset with 61,940 images from 40 classes and 6 world regions, and no personally identifiable information, collected through crowd-sourcing.
Aachen Day-Night v1.1 dataset is an extended version of the original Aachen Day-Night dataset. Besides the original query images, the Aachen Day-Night v1.1 dataset contains an additional 93 nighttime queries. In addition, it uses a larger 3D model containing additional images. These additional images were extracted from video sequences captured with different cameras. Please refer to Reference Pose Generation for Long-term Visual Localization via Learned Features and View Synthesis for more information.
OMMO is a new benchmark for several outdoor NeRF-based tasks, such as novel view synthesis, surface reconstruction, and multi-modal NeRF. It contains complex objects and scenes with calibrated images, point clouds and prompt annotations.
BANDON is a dataset for building change detection with off-nadir aerial images dataset, which is composed of off-Nadir image pairs of urban and rural areas. Overall, the BANDON dataset contains 2283 pairs of images, 2283 change labels,1891 BT-flows labels, 1891 pairs of segmentation labels, and 1891 pair of ST-offsets labels (test sets do not provide auxiliary annotations).
GHOSTS is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. It a collection of multiple datasets of prompts, totalling 728 prompts, for which ChatGPT’s output was manually rated by experts.
The TACRED-Revisited dataset improves the crowd-sourced TACRED dataset for relation extraction by relabeling the dev and test sets using expert linguistic annotators. Relabeling focuses on the 5K most challenging instances in dev and test, in total, 51.2% of these are corrected. Published at ACL 2020.
A multivariate spatio-temporal benchmark dataset for meteorological forecasting based on real-time observation data from ground weather stations.