19,997 machine learning datasets
19,997 dataset results
TempQA-WD is a benchmark dataset for temporal reasoning designed to encourage research in extending the present approaches to target a more challenging set of complex reasoning tasks. Specifically, the benchmark is a temporal question answering dataset with the following advantages: (a) it is based on Wikidata, which is the most frequently curated, openly available knowledge base, (b) it includes intermediate sparql queries to facilitate the evaluation of semantic parsing based approaches for KBQA, and (c) it generalizes to multiple knowledge bases: Freebase and Wikidata.
Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR) is a dataset for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, the dataset is augmented using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one.
RuMedBench is a benchmark dataset for Russian medical language understanding.
IKEA Object State Dataset is a new dataset that contains IKEA furniture 3D models, RGBD video of the assembly process, the 6DoF pose of furniture parts and their bounding box.
KazNERD is a dataset for Kazakh named entity recognition. The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh, as well as annotation guidelines containing straightforward--but rigorous--rules and examples. The dataset annotation, based on the IOB2 scheme, was carried out on television news text by two native Kazakh speakers under the supervision of the first author. The resulting dataset contains 112,702 sentences and 136,333 annotations for 25 entity classes.
Prepositions Annotated with Supersense Tags in Reddit International English (PASTRIE) is a new corpus containing manually annotated preposition supersenses of English data from presumed speakers of four L1s: English, French, German, and Spanish
Korean tabular dataset is a collection of 1.4M tables with corresponding descriptions for unsupervised pre-training language models. Korean table question answering corpus consists of 70k pairs of questions and answers created by crowd-sourced workers.
We provide a dataset called MMAC Captions for sensor-augmented egocentric-video captioning. The dataset contains 5,002 activity descriptions by extending the CMU-MMAC dataset. A number of activity description examples can be found in the homepage.
This corpus was constructed by collecting 10,008 reviews from various domains, including sports, food, software, politics, and entertainment. Human annotators manually tagged the reviews into positive (n = 3662), negative (n = 2619), and neutral (n = 3727) categories.
The Tsinghua-Daimler Cyclist Benchmark provides a benchmark dataset for cyclist detection. Bounding Box based labels are provided for the classes: ("pedestrian", "cyclist", "motorcyclist", "tricyclist", "wheelchairuser", "mopedrider").
The size of the data set is about 1GB. The data set consists of 900 image sequences of 9 gesture classes, which are defined by 3 primitive hand shapes and 3 primitive motions. Therefore, the target task for this data set is to classify different shapes as well as different motions at a time.
The peer-reviewed paper of AWARE dataset is published in ASEW 2021, and can be accessed through: http://doi.org/10.1109/ASEW52652.2021.00049. Kindly cite this paper when using AWARE dataset.
PPG-DaLiA is a publicly available dataset for PPG-based heart rate estimation. This multimodal dataset features physiological and motion data, recorded from both a wrist- and a chest-worn device, of 15 subjects while performing a wide range of activities under close to real-life conditions. The included ECG data provides heart rate ground truth. The included PPG- and 3D-accelerometer data can be used for heart rate estimation, while compensating for motion artefacts.
The Norwegian Parliamentary Speech Corpus (NPSC) is a speech corpus made by the Norwegian Language Bank at the National Library of Norway in 2019-2021. The NPSC consists of recordings of speech from Stortinget, the Norwegian parliament, and corresponding orthographic transcriptions to Norwegian Bokmål and Norwegian Nynorsk. All transcriptions are done manually by trained linguists or philologists, and the manual transcriptions are subsequently proofread to ensure consistency and accuracy. Entire days of Parliamentary meetings are transcribed in the dataset.
This database offers iris images (with and without contact lenses) of the same eyes captured shortly one after another with illumination coming from two different locations. 5,796 iris images in total were acquired by the LG IrisAccess 4000 sensor from 119 subjects. This set is divided into four subsets used in the experiments: (a) 1,800 images of irises wearing regular (with dot-like pattern) textured contact lenses, as shown in Fig. 6a in the wAcv 2019 paper; (b) 864 images of irises wearing irregular (without dot-like pattern) textured contact lenses, as shown in Fig. 6b in the WACV 2019 paper; (c) 1,728 images of irises wearing clear contact lenses (without any visible pattern), and (d) 1,404 images of authentic irises without any contact.
ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
This dataset contains 54,987 UI screenshots and the metadata from 7,748 Android applications belonging to 25 application categories
The CENTER-TBI database contains prospectively collected data of more than 4,500 patients with TBI in Europe. The Registry and Acute Care data has been collected during a 3 years’ period (2015-2017) in 65 centers in Europe. For all patients, outcome data has been collected up to 2 years after injury.
This dataset comprises 1344 expert annotated images of muscle-tendon junctions recorded with 3 ultrasound imaging systems (Aixplorer V6, Esaote MyLab60, Telemed ArtUs), on 2 muscles (Lateral Gastrocnemius, Medial Gastrocnemius), and 2 movements (isometric maximum voluntary contractions, passive torque movements).
The protein-ligand complexes of PDBBind v2020 preprocessed as described in the paper "EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction" with associated code at https://github.com/HannesStark/EquiBind