Datasets

19,997 machine learning datasets

19,997 dataset results

ECG5000

The original dataset for "ECG5000" is a 20-hour long ECG downloaded from Physionet. The name is BIDMC Congestive Heart Failure Database(chfdb) and it is record "chf07". It was originally published in "Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23)". The data was pre-processed in two steps: (1) extract each heartbeat, (2) make each heartbeat equal length using interpolation. This dataset was originally used in paper "A general framework for never-ending learning from time series streams", DAMI 29(6). After that, 5,000 heartbeats were randomly selected. The patient has severe congestive heart failure and the class values were obtained by automated annotation

6 papers4 benchmarks

Food-101N

The Food-101N dataset is introduced in "CleanNet: Transfer Learning for Scalable Image Training with Label Noise (CVPR'18). It is an image dataset containing about 310,009 images of food recipes classified in 101 classes (categories). Food-101N and the Food-101 dataset share the same 101 classes, whereas Food-101N has much more images and is more noisy.

6 papers1 benchmarksImages

LCQMC (Large-scale Chinese Question Matching Corpus)

LCQMC is a large-scale Chinese question matching corpus. LCQMC is more general than paraphrase corpus as it focuses on intent matching rather than paraphrase. The corpus contains 260,068 question pairs with manual annotation.

6 papers0 benchmarksTexts

Im2GPS

Dataset of over 6 million GPS-tagged images from Flickr. Training dataset is private. Test dataset is composed by 237 images.

6 papers16 benchmarksImages

BC4CHEMD (BioCreative IV Chemical compound and drug name recognition)

Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles

6 papers1 benchmarksTexts

SherLIiC

SherLIiC is a testbed for lexical inference in context (LIiC), consisting of 3985 manually annotated inference rule candidates (InfCands), accompanied by (i) ~960k unlabeled InfCands, and (ii) ~190k typed textual relations between Freebase entities extracted from the large entity-linked corpus ClueWeb09. Each InfCand consists of one of these relations, expressed as a lemmatized dependency path, and two argument placeholders, each linked to one or more Freebase types.

6 papers0 benchmarksTexts

DebateSum

DebateSum consists of 187328 debate documents, arguments (also can be thought of as abstractive summaries, or queries), word-level extractive summaries, citations, and associated metadata organized by topic-year. This data is ready for analysis by NLP systems.

6 papers2 benchmarksTexts

AM-2K (Animal Matting 2,000 Dataset)

AM-2k (Animal Matting 2,000 Dataset) consists of 2,000 high-resolution images collected and carefully selected from websites with open licenses. AM-2k contains 20 categories of animals including alpaca, antelope, bear, camel, cat, cattle, deer, dog, elephant, giraffe, horse, kangaroo, leopard, lion, monkey, rabbit, rhinoceros, sheep, tiger, zebra, each with 100 real-world images of various appearance and diverse backgrounds

6 papers3 benchmarks

CCMixter

CCMixter is a singing voice separation dataset consisting of 50 full-length stereo tracks from ccMixter featuring many different musical genres. For each song there are three WAV files available: the background music, the voice signal, and their sum.

6 papers0 benchmarksAudio

AtariARI (Atari Annotated RAM Interface)

The AtariARI (Atari Annotated RAM Interface) is an environment for representation learning. The Atari Arcade Learning Environment (ALE) does not explicitly expose any ground truth state information. However, ALE does expose the RAM state (128 bytes per timestep) which are used by the game programmer to store important state information such as the location of sprites, the state of the clock, or the current room the agent is in. To extract these variables, the dataset creators consulted commented disassemblies (or source code) of Atari 2600 games which were made available by Engelhardt and Jentzsch and CPUWIZ. The dataset creators were able to find and verify important state variables for a total of 22 games. Once this information was acquired, combining it with the ALE interface produced a wrapper that can automatically output a state label for every example frame generated from the game. The dataset creators make this available with an easy-to-use gym wrapper, which returns this infor

6 papers0 benchmarksEnvironment

SemEval-2014 Task-10

SemEval 2014 is a collection of datasets used for the Semantic Evaluation (SemEval) workshop, an annual event that focuses on the evaluation and comparison of systems that can analyze diverse semantic phenomena in text. The datasets from SemEval 2014 are used for various tasks, including but not limited to:

6 papers0 benchmarksTexts

MEDIA

The MEDIA French corpus is dedicated to semantic extraction from speech in a context of human/machine dialogues. The corpus has manual transcription and conceptual annotation of dialogues from 250 speakers. It is split into the following three parts : (1) the training set (720 dialogues, 12K sentences), (2) the development set (79 dialogues, 1.3K sentences, and (3) the test set (200 dialogues, 3K sentences).

6 papers0 benchmarksAudio, Texts

OMICS (Open Mind Indoor Common Sense)

OMICS is an extensive collection of knowledge for indoor service robots gathered from internet users. Currently, it contains 48 tables capturing different sorts of knowledge. Each tuple of the Help table maps a user desire to a task that may meet the desire (e.g., ⟨ “feel thirsty”, “by offering drink” ⟩). Each tuple of the Tasks/Steps table decomposes a task into several steps (e.g., ⟨ “serve a drink”, 0. “get a glass”, 1. “get a bottle”, 2. “fill class from bottle”, 3. “give class to person” ⟩). Given this, OMICS offers useful knowledge about hierarchism of naturalistic instructions, where a high-level user request (e.g., “serve a drink”) can be reduced to lower-level tasks (e.g., “get a glass”, ⋯). Another feature of OMICS is that elements of any tuple in an OMICS table are semantically related according to a predefined template. This facilitates the semantic interpretation of the OMICS tuples.

6 papers0 benchmarksTexts

TRECVID

TRECVID is a yearly set of competitions centered on video retrieval and indexing, hosting a variety of video data sets.

6 papers1 benchmarksImages, Videos

ISIC 2018 Task 2

The ISIC 2018 dataset was published by the International Skin Imaging Collaboration (ISIC) as a large-scale dataset of dermoscopy images. The Task 2 dataset is the challenge on lesion attribute detection. It includes 2594 images. The task is to detect the following dermoscopic attributes: pigment network, negative network, streaks, mila-like cysts and globules (including dots).

6 papers0 benchmarksImages, Medical

XQA

XQA is a data which consists of a total amount of 90k question-answer pairs in nine languages for cross-lingual open-domain question answering.

6 papers0 benchmarksTexts

TalkSumm

The TalkSumm dataset contains 1705 automatically-generated summaries of scientific papers from ACL, NAACL, EMNLP, SIGDIAL (2015-2018), and ICML (2017-2018).

6 papers0 benchmarksTexts

KnowledgeNet

KnowledgeNet is a benchmark dataset for the task of automatically populating a knowledge base (Wikidata) with facts expressed in natural language text on the web. KnowledgeNet provides text exhaustively annotated with facts, thus enabling the holistic end-to-end evaluation of knowledge base population systems as a whole, unlike previous benchmarks that are more suitable for the evaluation of individual subcomponents (e.g., entity linking, relation extraction).

6 papers0 benchmarksTexts

WikiCREM

An unsupervised dataset for co-reference resolution. Presented in the publication: Kocijan et. al, WikiCREM: A Large Unsupervised Corpus for Coreference Resolution, presented at EMNLP 2019.

6 papers0 benchmarksTexts

BiPaR

BiPaR is a manually annotated bilingual parallel novel-style machine reading comprehension (MRC) dataset, developed to support monolingual, multilingual and cross-lingual reading comprehension on novels. The biggest difference between BiPaR and existing reading comprehension datasets is that each triple (Passage, Question, Answer) in BiPaR is written in parallel in two languages. BiPaR is diverse in prefixes of questions, answer types and relationships between questions and passages. Answering the questions requires reading comprehension skills of coreference resolution, multi-sentence reasoning, and understanding of implicit causality.

6 papers0 benchmarksTexts

PreviousPage 193 of 1000Next