Datasets

19,997 machine learning datasets

19,997 dataset results

CC-News (CommonCrawl News dataset)

CommonCrawl News is a dataset containing news articles from news sites all over the world. The dataset is available in form of Web ARChive (WARC) files that are released on a daily basis.

2 papers0 benchmarksTexts

Endotect Polyp Segmentation Challenge Dataset

A challenge that consists of three tasks, each targeting a different requirement for in-clinic use. The first task involves classifying images from the GI tract into 23 distinct classes. The second task focuses on efficiant classification measured by the amount of time spent processing each image. The last task relates to automatcially segmenting polyps.

2 papers3 benchmarksBiomedical, Images, Medical

DeepFluoroLabeling-IPCAI2020

This collection contains data and code associated with the IPCAI/IJCARS 2020 paper “Automatic Annotation of Hip Anatomy in Fluoroscopy for Robust and Efficient 2D/3D Registration.” The data hosted here consists of annotated datasets of actual hip fluoroscopy, CT and derived data from six lower torso cadaveric specimens. Documentation and examples for using the dataset and Python code for training and testing the proposed models are also included. Higher-level information, including clinical motivations, prior works, algorithmic details, applications to 2D/3D registration, and experimental details, may be found in the companion paper which is available at https://arxiv.org/abs/1911.07042 or https://doi.org/10.1007/s11548-020-02162-7. We hope that this code and data will be useful in the development of new computer-assisted capabilities that leverage fluoroscopy.

2 papers0 benchmarks3D, Biomedical, Images, Medical

PhysioNet Challenge 2016

Introduction The 2016 PhysioNet/CinC Challenge aims to encourage the development of algorithms to classify heart sound recordings collected from a variety of clinical or nonclinical (such as in-home visits) environments. The aim is to identify, from a single short recording (10-60s) from a single precordial location, whether the subject of the recording should be referred on for an expert diagnosis.

2 papers0 benchmarksAudio, Medical

THFOOD-50 (Thai Food 50 Image Classification)

Fine-Grained Thai Food Image Classification Datasets

2 papers0 benchmarks

COCO Object Detection VIPriors subset

The training and validation data are subsets of the training split of the MS COCO dataset (2017 release, bounding boxes only). The test set is taken from the validation split of the MS COCO dataset.

2 papers0 benchmarksImages

Cityscapes VIPriors subset

The training and validation data are subsets of the training split of the Cityscapes dataset. The test set is taken from the validation split of the Cityscapes dataset.

2 papers4 benchmarksImages

TeachMyAgent

TeachMyAgent (TA) is a benchmark for Automatic Curriculum Learning (ACL) algorithms leveraging procedural task generation. It includes 1) challenge-specific unit-tests using variants of a procedural Box2D bipedal walker environment, and 2) a new procedural Parkour environment combining most ACL challenges, making it ideal for global performance assessment.

2 papers0 benchmarksEnvironment

HDA Facial Tattoo and Painting Database

The Hochschule Darmstadt (HDA) facial tattoo and paintings database contains 500 pairs of facial images of individuals with and without facial tattoos or paintings. The database was collected from multiple online sources.

2 papers0 benchmarksImages

DF20 - Mini (Danish Fungi 2020 - Mini)

Danish Fungi 2020 (DF20) is a novel fine-grained dataset and benchmark. The dataset, constructed from observations submitted to the Danish Fungal Atlas, is unique in its taxonomy-accurate class labels, small number of errors, highly unbalanced long-tailed class distribution, rich observation metadata, and well-defined class hierarchy. DF20 has zero overlap with ImageNet, allowing unbiased comparison of models fine-tuned from publicly available ImageNet checkpoints.

2 papers3 benchmarksImages

PESMOD (PExels Small Moving Object Detection)

The PESMOD (PExels Small Moving Object Detection) dataset consists of high resolution aerial images in which moving objects are labelled manually. It was created from videos selected from the Pexels website. The aim of this dataset is to provide a different and challenging dataset for moving object detection methods evaluation. Each moving object is labelled for each frame with PASCAL VOC format in a XML file. The dataset consists of 8 different video sequences.

2 papers0 benchmarksVideos

Alsat-2B

Alsat-2B is a remote sensing dataset of low and high spatial resolution images (10m and 2.5m respectively) for the single-image super-resolution task. The high-resolution images are obtained through pan-sharpening. The dataset has been created from 13 images captured by the Alsat-2B Earth observation satellite, where the image cover 13 different cities.

2 papers0 benchmarksImages

SNDZoo (The Softwarised Network Data Zoo)

The softwarised network data zoo (SNDZoo) is an open collection of software networking data sets aiming to streamline and ease machine learning research in the software networking domain. Most of the published data sets focus on, but are not limited to, the performance of virtualised network functions (VNFs). The data is collected using fully automated NFV benchmarking frameworks, such as tng-bench, developed by us or third party solutions like Gym. The collection of the presented data sets follows the general VNF benchmarking methodology described in.

2 papers0 benchmarksTabular, Time series, Tracking

StyleKQC

StyleKQC is a style-variant paraphrase corpus for korean questions and commands. It was built with a corpus construction scheme that simultaneously considers the core content and style of directives, namely intent and formality, for the Korean language. Utilizing manually generated natural language queries on six daily topics, the corpus was expanded to formal and informal sentences by human rewriting and transferring.

2 papers0 benchmarksTexts

LaboroTVSpeech

LaboroTVSpeech is a large-scale Japanese speech corpus built from broadcast TV recordings and their subtitles. It contains over 2,000 hours of speech.

2 papers0 benchmarksSpeech

INSTRE

INSTRE is a benchmark for INSTance-level visual object REtrieval and REcognition (INSTRE). INSTRE has the following major properties: (1) balanced data scale, (2) more diverse intraclass instance variations, (3) cluttered and less contextual backgrounds, (4) object localization annotation for each image, (5) well-manipulated double-labelled images for measuring multiple object (within one image) case.

2 papers1 benchmarksImages

THYME-2016

2 papers1 benchmarksMedical, Texts

BS-RSCD

BS-RSCD is a dataset for rolling shutter correction and deblurring (RSCD). The dataset includes both ego-motion and object-motion in dynamic scenes. Real distorted and blurry videos with corresponding ground truth are recorded simultaneously via a beam-splitter-based acquisition system.

2 papers0 benchmarksImages

BiasCorp

BiasCorp is a dataset for racism detection containing 139,090 comments and news segment from three specific sources - Fox News, BreitbartNews and YouTube.

2 papers0 benchmarksTexts

EtymDB 2.0

A multilingual etymological database extracted from the Wiktionary (described in Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0)

2 papers0 benchmarksTexts

PreviousPage 308 of 1000Next