Datasets

19,997 machine learning datasets

19,997 dataset results

USPTO-50k

Subset and preprocessed version of Chemical reactions from US patents (1976-Sep2016) by Daniel Lowe. It includes 50K randomly selected reactions that was later classified into 10 reaction classes by Nadine Schneider et al.

60 papers6 benchmarks

Brightkite

Brightkite was once a location-based social networking service provider where users shared their locations by checking-in. The friendship network was collected using their public API, and consists of 58,228 nodes and 214,078 edges. The network is originally directed but the collectors have constructed a network with undirected edges when there is a friendship in both ways. The collectors have also collected a total of 4,491,143 checkins of these users over the period of Apr. 2008 - Oct. 2010.

60 papers0 benchmarks

Penn94

Node classification on Penn94

60 papers2 benchmarksGraphs

SummScreen

SummScreen is a dataset for abstractive screenplay summarization. It consists of pairs of TV series transcripts and human-written recaps. This dataset provides a challenging testbed for abstractive summarization for several reasons: - Plot details are often expressed indirectly in character dialogues and may be scattered across the entirety of the transcript. - These details must be found and integrated to form the succinct plot descriptions in the recaps. - TV scripts contain content that does not directly pertain to the central plot but rather serves to develop characters or provide comic relief. This information is rarely contained in recaps.

60 papers4 benchmarks

CHASE_DB1

CHASE_DB1 is a dataset for retinal vessel segmentation which contains 28 color retina images with the size of 999×960 pixels which are collected from both left and right eyes of 14 school children. Each image is annotated by two independent human experts.

59 papers18 benchmarksImages, Medical

Set11

Set11 is a dataset of 11 grayscale images. It is a dataset used for image reconstruction and image compression.

59 papers0 benchmarksImages

SParC (Semantic Parsing in Context)

SParC is a large-scale dataset for complex, cross-domain, and context-dependent (multi-turn) semantic parsing and text-to-SQL task (interactive natural language interfaces for relational databases).

59 papers5 benchmarksTexts

CELEX

CELEX database comprises three different searchable lexical databases, Dutch, English and German. The lexical data contained in each database is divided into five categories: orthography, phonology, morphology, syntax (word class) and word frequency.

59 papers0 benchmarksTexts

Middlebury 2014

The Middlebury 2014 dataset contains a set of 23 high resolution stereo pairs for which known camera calibration parameters and ground truth disparity maps obtained with a structured light scanner are available. The images in the Middlebury dataset all show static indoor scenes with varying difficulties including repetitive structures, occlusions, wiry objects as well as untextured areas.

59 papers9 benchmarksImages, Stereo

MedMNIST

A collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28x28 images, which requires no background knowledge.

59 papers0 benchmarks

WOS (Web of Science Dataset)

Web of Science (WOS) is a document classification dataset that contains 46,985 documents with 134 categories which include 7 parents categories.

59 papers2 benchmarksTexts

Oxford-IIIT Pets

The Oxford-IIIT Pet Dataset is a 37-category pet dataset with roughly 200 images for each class. The images have large variations in scale, pose, and lighting. All images have an associated ground truth annotation of breed, head ROI, and pixel-level trimap segmentation.

59 papers11 benchmarks

CASIA-MFSD

CASIA-MFSD is a dataset for face anti-spoofing. It contains 50 subjects, and 12 videos for each subject under different resolutions and light conditions. Three different spoof attacks are designed: replay, warp print and cut print attacks. The database contains 600 video recordings, in which 240 videos of 20 subjects are used for training and 360 videos of 30 subjects for testing.

58 papers16 benchmarksImages, Videos

EmoryNLP

EmoryNLP comprises 97 episodes, 897 scenes, and 12,606 utterances, where each utterance is annotated with one of the seven emotions borrowed from the six primary emotions in the Willcox (1982)’s feeling wheel, sad, mad, scared, powerful, peaceful, joyful, and a default emotion of neutral.

58 papers2 benchmarksAudio, Texts

CANARD (A Dataset for Question-in-Context Rewriting)

CANARD is a dataset for question-in-context rewriting that consists of questions each given in a dialog context together with a context-independent rewriting of the question. The context of each question is the dialog utterences that precede the question. CANARD can be used to evaluate question rewriting models that handle important linguistic phenomena such as coreference and ellipsis resolution.

58 papers1 benchmarksTexts

ETH (ETH Pedestrian)

ETH is a dataset for pedestrian detection. The testing set contains 1,804 images in three video clips. The dataset is captured from a stereo rig mounted on car, with a resolution of 640 x 480 (bayered), and a framerate of 13--14 FPS.

58 papers1 benchmarksImages, Videos

MultiTHUMOS

The MultiTHUMOS dataset contains dense, multilabel, frame-level action annotations for 30 hours across 400 videos in the THUMOS'14 action detection dataset. It consists of 38,690 annotations of 65 action classes, with an average of 1.5 labels per frame and 10.5 action classes per video.

58 papers41 benchmarksVideos

ExDark (Exclusively Dark Image Dataset)

The Exclusively Dark (ExDARK) dataset is a collection of 7,363 low-light images from very low-light environments to twilight (i.e 10 different conditions) with 12 object classes (similar to PASCAL VOC) annotated on both image class level and local object bounding boxes.

58 papers5 benchmarksImages

LCSTS

LCSTS is a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which is released to the public. This corpus consists of over 2 million real Chinese short texts with short summaries given by the author of each text. The authors also manually tagged the relevance of 10,666 short summaries with their corresponding short texts 10,666 short summaries with their corresponding short texts.

58 papers2 benchmarksTexts

SoccerNet-v2

A novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production.

58 papers13 benchmarksVideos

PreviousPage 50 of 1000Next