Datasets

19,997 machine learning datasets

19,997 dataset results

TempQA-WD

TempQA-WD is a benchmark dataset for temporal reasoning designed to encourage research in extending the present approaches to target a more challenging set of complex reasoning tasks. Specifically, the benchmark is a temporal question answering dataset with the following advantages: (a) it is based on Wikidata, which is the most frequently curated, openly available knowledge base, (b) it includes intermediate sparql queries to facilitate the evaluation of semantic parsing based approaches for KBQA, and (c) it generalizes to multiple knowledge bases: Freebase and Wikidata.

2 papers1 benchmarksTexts

CI-AVSR

Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR) is a dataset for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, the dataset is augmented using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one.

2 papers0 benchmarksSpeech

RuMedBench

RuMedBench is a benchmark dataset for Russian medical language understanding.

2 papers0 benchmarksTexts

IKEA Object State Dataset

IKEA Object State Dataset is a new dataset that contains IKEA furniture 3D models, RGBD video of the assembly process, the 6DoF pose of furniture parts and their bounding box.

2 papers0 benchmarks3D

KazNERD

KazNERD is a dataset for Kazakh named entity recognition. The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh, as well as annotation guidelines containing straightforward--but rigorous--rules and examples. The dataset annotation, based on the IOB2 scheme, was carried out on television news text by two native Kazakh speakers under the supervision of the first author. The resulting dataset contains 112,702 sentences and 136,333 annotations for 25 entity classes.

2 papers0 benchmarksTexts

PASTRIE (Prepositions Annotated with Supersense Tags in Reddit International English)

Prepositions Annotated with Supersense Tags in Reddit International English (PASTRIE) is a new corpus containing manually annotated preposition supersenses of English data from presumed speakers of four L1s: English, French, German, and Spanish

2 papers0 benchmarks

Korean Table Question Answering

Korean tabular dataset is a collection of 1.4M tables with corresponding descriptions for unsupervised pre-training language models. Korean table question answering corpus consists of 70k pairs of questions and answers created by crowd-sourced workers.

2 papers0 benchmarks

MMAC Captions

We provide a dataset called MMAC Captions for sensor-augmented egocentric-video captioning. The dataset contains 5,002 activity descriptions by extending the CMU-MMAC dataset. A number of activity description examples can be found in the homepage.

2 papers0 benchmarksTime series, Videos

Urdu Online Reviews

This corpus was constructed by collecting 10,008 reviews from various domains, including sports, food, software, politics, and entertainment. Human annotators manually tagged the reviews into positive (n = 3662), negative (n = 2619), and neutral (n = 3727) categories.

2 papers1 benchmarks

Tsinghua-Daimler Cyclist Benchmark

The Tsinghua-Daimler Cyclist Benchmark provides a benchmark dataset for cyclist detection. Bounding Box based labels are provided for the classes: ("pedestrian", "cyclist", "motorcyclist", "tricyclist", "wheelchairuser", "mopedrider").

2 papers0 benchmarks

CamGes (Cambridge Hand Gesture Dataset)

The size of the data set is about 1GB. The data set consists of 900 image sequences of 9 gesture classes, which are defined by 3 primitive hand shapes and 3 primitive motions. Therefore, the target task for this data set is to classify different shapes as well as different motions at a time.

2 papers0 benchmarksImages

AWARE (AWARE: Aspect-Based Sentiment Analysis Dataset of Apps Reviews for Requirements Elicitation)

The peer-reviewed paper of AWARE dataset is published in ASEW 2021, and can be accessed through: http://doi.org/10.1109/ASEW52652.2021.00049. Kindly cite this paper when using AWARE dataset.

2 papers3 benchmarksTexts

PPG Dalia (PPG Field Study Dataset)

PPG-DaLiA is a publicly available dataset for PPG-based heart rate estimation. This multimodal dataset features physiological and motion data, recorded from both a wrist- and a chest-worn device, of 15 subjects while performing a wide range of activities under close to real-life conditions. The included ECG data provides heart rate ground truth. The included PPG- and 3D-accelerometer data can be used for heart rate estimation, while compensating for motion artefacts.

2 papers0 benchmarksTime series

NPSC (Norwegian Parliamentary Speech Corpus)

The Norwegian Parliamentary Speech Corpus (NPSC) is a speech corpus made by the Norwegian Language Bank at the National Library of Norway in 2019-2021. The NPSC consists of recordings of speech from Stortinget, the Norwegian parliament, and corresponding orthographic transcriptions to Norwegian Bokmål and Norwegian Nynorsk. All transcriptions are done manually by trained linguists or philologists, and the manual transcriptions are subsequently proofread to ensure consistency and accuracy. Entire days of Parliamentary meetings are transcribed in the dataset.

2 papers0 benchmarksSpeech, Texts

NDPSID - WACV 2019 (Notre Dame Photometric Stereo Iris Dataset)

This database offers iris images (with and without contact lenses) of the same eyes captured shortly one after another with illumination coming from two different locations. 5,796 iris images in total were acquired by the LG IrisAccess 4000 sensor from 119 subjects. This set is divided into four subsets used in the experiments: (a) 1,800 images of irises wearing regular (with dot-like pattern) textured contact lenses, as shown in Fig. 6a in the wAcv 2019 paper; (b) 864 images of irises wearing irregular (without dot-like pattern) textured contact lenses, as shown in Fig. 6b in the WACV 2019 paper; (c) 1,728 images of irises wearing clear contact lenses (without any visible pattern), and (d) 1,404 images of authentic irises without any contact.

2 papers0 benchmarks

ChEMBL

ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.

2 papers0 benchmarksGraphs

UI5k (Mobile App User Interface Dataset)

This dataset contains 54,987 UI screenshots and the metadata from 7,748 Android applications belonging to 25 application categories

2 papers0 benchmarksImages

CENTER-TBI (Collaborative European NeuroTrauma Effectiveness Research in TBI)

The CENTER-TBI database contains prospectively collected data of more than 4,500 patients with TBI in Europe. The Registry and Acute Care data has been collected during a 3 years’ period (2015-2017) in 65 centers in Europe. For all patients, outcome data has been collected up to 2 years after injury.

2 papers0 benchmarksMedical

deepMTJ_IEEEtbme

This dataset comprises 1344 expert annotated images of muscle-tendon junctions recorded with 3 ultrasound imaging systems (Aixplorer V6, Esaote MyLab60, Telemed ArtUs), on 2 muscles (Lateral Gastrocnemius, Medial Gastrocnemius), and 2 movements (isometric maximum voluntary contractions, passive torque movements).

2 papers0 benchmarksImages

EquiBind data (EquiBind preprocessing of PDBBind v2020)

The protein-ligand complexes of PDBBind v2020 preprocessed as described in the paper "EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction" with associated code at https://github.com/HannesStark/EquiBind

2 papers0 benchmarks

PreviousPage 321 of 1000Next