19,997 machine learning datasets
19,997 dataset results
CommonCrawl News is a dataset containing news articles from news sites all over the world. The dataset is available in form of Web ARChive (WARC) files that are released on a daily basis.
A challenge that consists of three tasks, each targeting a different requirement for in-clinic use. The first task involves classifying images from the GI tract into 23 distinct classes. The second task focuses on efficiant classification measured by the amount of time spent processing each image. The last task relates to automatcially segmenting polyps.
This collection contains data and code associated with the IPCAI/IJCARS 2020 paper “Automatic Annotation of Hip Anatomy in Fluoroscopy for Robust and Efficient 2D/3D Registration.” The data hosted here consists of annotated datasets of actual hip fluoroscopy, CT and derived data from six lower torso cadaveric specimens. Documentation and examples for using the dataset and Python code for training and testing the proposed models are also included. Higher-level information, including clinical motivations, prior works, algorithmic details, applications to 2D/3D registration, and experimental details, may be found in the companion paper which is available at https://arxiv.org/abs/1911.07042 or https://doi.org/10.1007/s11548-020-02162-7. We hope that this code and data will be useful in the development of new computer-assisted capabilities that leverage fluoroscopy.
Introduction The 2016 PhysioNet/CinC Challenge aims to encourage the development of algorithms to classify heart sound recordings collected from a variety of clinical or nonclinical (such as in-home visits) environments. The aim is to identify, from a single short recording (10-60s) from a single precordial location, whether the subject of the recording should be referred on for an expert diagnosis.
Fine-Grained Thai Food Image Classification Datasets
The training and validation data are subsets of the training split of the MS COCO dataset (2017 release, bounding boxes only). The test set is taken from the validation split of the MS COCO dataset.
The training and validation data are subsets of the training split of the Cityscapes dataset. The test set is taken from the validation split of the Cityscapes dataset.
TeachMyAgent (TA) is a benchmark for Automatic Curriculum Learning (ACL) algorithms leveraging procedural task generation. It includes 1) challenge-specific unit-tests using variants of a procedural Box2D bipedal walker environment, and 2) a new procedural Parkour environment combining most ACL challenges, making it ideal for global performance assessment.
The Hochschule Darmstadt (HDA) facial tattoo and paintings database contains 500 pairs of facial images of individuals with and without facial tattoos or paintings. The database was collected from multiple online sources.
Danish Fungi 2020 (DF20) is a novel fine-grained dataset and benchmark. The dataset, constructed from observations submitted to the Danish Fungal Atlas, is unique in its taxonomy-accurate class labels, small number of errors, highly unbalanced long-tailed class distribution, rich observation metadata, and well-defined class hierarchy. DF20 has zero overlap with ImageNet, allowing unbiased comparison of models fine-tuned from publicly available ImageNet checkpoints.
The PESMOD (PExels Small Moving Object Detection) dataset consists of high resolution aerial images in which moving objects are labelled manually. It was created from videos selected from the Pexels website. The aim of this dataset is to provide a different and challenging dataset for moving object detection methods evaluation. Each moving object is labelled for each frame with PASCAL VOC format in a XML file. The dataset consists of 8 different video sequences.
Alsat-2B is a remote sensing dataset of low and high spatial resolution images (10m and 2.5m respectively) for the single-image super-resolution task. The high-resolution images are obtained through pan-sharpening. The dataset has been created from 13 images captured by the Alsat-2B Earth observation satellite, where the image cover 13 different cities.
The softwarised network data zoo (SNDZoo) is an open collection of software networking data sets aiming to streamline and ease machine learning research in the software networking domain. Most of the published data sets focus on, but are not limited to, the performance of virtualised network functions (VNFs). The data is collected using fully automated NFV benchmarking frameworks, such as tng-bench, developed by us or third party solutions like Gym. The collection of the presented data sets follows the general VNF benchmarking methodology described in.
StyleKQC is a style-variant paraphrase corpus for korean questions and commands. It was built with a corpus construction scheme that simultaneously considers the core content and style of directives, namely intent and formality, for the Korean language. Utilizing manually generated natural language queries on six daily topics, the corpus was expanded to formal and informal sentences by human rewriting and transferring.
LaboroTVSpeech is a large-scale Japanese speech corpus built from broadcast TV recordings and their subtitles. It contains over 2,000 hours of speech.
INSTRE is a benchmark for INSTance-level visual object REtrieval and REcognition (INSTRE). INSTRE has the following major properties: (1) balanced data scale, (2) more diverse intraclass instance variations, (3) cluttered and less contextual backgrounds, (4) object localization annotation for each image, (5) well-manipulated double-labelled images for measuring multiple object (within one image) case.
BS-RSCD is a dataset for rolling shutter correction and deblurring (RSCD). The dataset includes both ego-motion and object-motion in dynamic scenes. Real distorted and blurry videos with corresponding ground truth are recorded simultaneously via a beam-splitter-based acquisition system.
BiasCorp is a dataset for racism detection containing 139,090 comments and news segment from three specific sources - Fox News, BreitbartNews and YouTube.
A multilingual etymological database extracted from the Wiktionary (described in Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0)