19,997 machine learning datasets
19,997 dataset results
Hansel is a human-annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities:
TAS-NIR is a VIS+NIR dataset of semantically annotated images in unstructured outdoor environments. It consists of 209 VIS+NIR image pairs with a fine-grained semantic segmentation.
JEMMA is an Extensible Java Dataset for ML4Code Applications, which is a large-scale dataset targeted at ML4 code. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods.
SPARF is a large-scale ShapeNet-based synthetic dataset for novel view synthesis consisting of ~17 million images rendered from nearly 40,000 shapes at high resolution (400×400 pixels).
The Berlin V2X dataset offers high-resolution GPS-located wireless measurements across diverse urban environments in the city of Berlin for both cellular and sidelink radio access technologies, acquired with up to 4 cars over 3 days. The data enables thus a variety of different ML studies towards vehicle-to-anything (V2X) communication.
PropSegmEnt is a corpus of over 35K propositions annotated by expert human raters. The dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity.
Dusha is a dataset for speech emotion recognition (SER) tasks. The corpus contains approximately 350 hours of data, more than 300 000 audio recordings with Russian speech and their transcripts. It is annotated using a crowd-sourcing platform and includes two subsets: acted and real-life.
It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta), Telugu (te), and monolingual dataset in English (en). This is the Wikipedia text <--> Wikidata KG aligned corpus used to train the data-to-text generation model. The Train & validation splits are created using distant supervision methods and Test data is generated through human annotations.
MAUD is an expert-annotated merger agreement reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points study, where lawyers and law students answered 92 questions about 152 merger agreements.
The Argoverse 2 Map Change Dataset is a collection of 1,000 scenarios with ring camera imagery, lidar, and HD maps. Two hundred of the scenarios include changes in the real-world environment that are not yet reflected in the HD map, such as new crosswalks or repainted lanes. By sharing a map dataset that labels the instances in which there are discrepancies with sensor data, we encourage the development of novel methods for detecting out-of-date map regions.
Causal Triplet is a causal representation learning benchmark featuring not only visually more complex scenes, but also two crucial desiderata commonly overlooked in previous works:
VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.
Binary labels for Validity and Novelty respectively are given for each Conclusion.
High-gamma dataset discribed in Schirrmeister et al. 2017
The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading.
Trinity Gesture Dataset includes 23 takes, totalling 244 minutes of motion capture and audio of a male native English speaker producing spontaneous speech on different topics. The actor’s motion was captured with 20 Viconcameras at 59.94 frames per second(fps), and the skeleton includes 69 joints.
Assessing the value of energy efficiency improvements can be challenging as there's no way to truly know how much energy a building would have used without the improvements. The best we can do is to build counterfactual models. Once a building is overhauled the new (lower) energy consumption is compared against modeled values for the original building to calculate the savings from the retrofit. More accurate models could support better market incentives and enable lower-cost financing.
ATUE is an antibody study benchmark with four real-world supervised tasks covering therapeutic antibody engineering, B cell analysis, and antibody discovery.
FaceOcc is a high-quality face occlusion dataset which contains all mislabeled occlusions in CelebAMask-HQ and complements some occlusions and textures from the internet. The occlusion types cover sunglasses, spectacles, hands, masks, scarfs, microphones, etc.
The Copiale Cipher is a 105 pages manuscript containing all in all around 75 000 characters. Beautifully bound in green and gold brocade paper, written on high quality paper with two different watermarks, the manuscript can be dated back to around 1750. Apart from what is obviously an owner's mark (“Philipp 1866”) and a note in the end of the last page (“Copiales 3”), the manuscript is completely encoded. The cipher employed consists of 100 different symbols, comprising all from Latin and Greek letters, to diacritics and graphich signs such as Zodiac and alchemical symbols. Catchwords (preview fragments) of one to three or four characters are written at the bottom of left–hand pages.