Datasets

3,148 machine learning datasets

3,148 dataset results

SST-3 (Stanford Sentiment Treebank: 3-way)

SST-5 is the Stanford Sentiment Treebank 5-way classification dataset (positive, somewhat positive, neutral, somewhat negative, negative). To create SST-3 (positive, neutral, negative), the 'somewhat positive' class was merged and treated as 'positive'. Similarly, the 'somewhat negative' class was merged and treated as 'negative'.

10 papers1 benchmarksTexts

Fermi

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

10 papers0 benchmarksTexts

SciTail

The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples with neutral label.

9 papers4 benchmarksTexts

MedNLI (Medical Natural Language Inference)

The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical History section of MIMIC-III clinical notes annotated for Definitely True, Maybe True and Definitely False. The dataset contains 11,232 training, 1,395 development and 1,422 test instances. This provides a natural language inference task (NLI) grounded in the medical history of patients.

9 papers4 benchmarksMedical, Texts

CAS-VSR-W1k (LRW-1000)

LRW-1000 has been renamed as CAS-VSR-W1k.* It is a naturally-distributed large-scale benchmark for word-level lipreading in the wild, including 1000 classes with about 718,018 video samples from more than 2000 individual speakers. There are more than 1,000,000 Chinese character instances in total. Each class corresponds to the syllables of a Mandarin word which is composed by one or several Chinese characters. This dataset aims to cover a natural variability over different speech modes and imaging conditions to incorporate challenges encountered in practical applications.

9 papers2 benchmarksAudio, Texts, Videos

FarsTail

Natural Language Inference (NLI), also called Textual Entailment, is an important task in NLP with the goal of determining the inference relationship between a premise p and a hypothesis h. It is a three-class problem, where each pair (p, h) is assigned to one of these classes: "ENTAILMENT" if the hypothesis can be inferred from the premise, "CONTRADICTION" if the hypothesis contradicts the premise, and "NEUTRAL" if none of the above holds. There are large datasets such as SNLI, MNLI, and SciTail for NLI in English, but there are few datasets for poor-data languages like Persian. Persian (Farsi) language is a pluricentric language spoken by around 110 million people in countries like Iran, Afghanistan, and Tajikistan. FarsTail is the first relatively large-scale Persian dataset for NLI task. A total of 10,367 samples are generated from a collection of 3,539 multiple-choice questions. The train, validation, and test portions include 7,266, 1,537, and 1,564 instances, respectively.

9 papers1 benchmarksTexts

How2R

Amazon Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment containing a single, self-contained scene. After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. Narrations are not provided to the workers to ensure that their written queries are based on visual content only. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words. From this process, 51,390 queries are collected for 24k 60-second clips from 9,371 videos in HowTo100M, on average 2-3 queries per clip. The video clips and its associated queries are split into 80% train, 10% val and 10% test.

9 papers0 benchmarksTexts, Videos

Business Scene Dialogue

The Japanese-English business conversation corpus, namely Business Scene Dialogue corpus, was constructed in 3 steps:

9 papers0 benchmarksTexts

CQASUMM

CQASUMM is a dataset for CQA (Community Question Answering) summarization, constructed from the 4.4 million Yahoo! Answers L6 dataset. The dataset contains ~300k annotated samples.

9 papers0 benchmarksTexts

Acronym Identification

Is an acronym disambiguation (AD) dataset for scientific domain with 62,441 samples which is significantly larger than the previous scientific AD dataset.

9 papers0 benchmarksTexts

ART Dataset (Abductive Reasoning in narrative Text)

ART consists of over 20k commonsense narrative contexts and 200k explanations.

9 papers0 benchmarksTexts

BIMCV COVID-19

BIMCV-COVID19+ dataset is a large dataset with chest X-ray images CXR (CR, DX) and computed tomography (CT) imaging of COVID-19 patients along with their radiographic findings, pathologies, polymerase chain reaction (PCR), immunoglobulin G (IgG) and immunoglobulin M (IgM) diagnostic antibody tests and radiographic reports from Medical Imaging Databank in Valencian Region Medical Image Bank (BIMCV). The findings are mapped onto standard Unified Medical Language System (UMLS) terminology and they cover a wide spectrum of thoracic entities, contrasting with the much more reduced number of entities annotated in previous datasets. Images are stored in high resolution and entities are localized with anatomical labels in a Medical Imaging Data Structure (MIDS) format. In addition, 23 images were annotated by a team of expert radiologists to include semantic segmentation of radiographic findings. Moreover, extensive information is provided, including the patient’s demographic information, type

9 papers0 benchmarksImages, Texts

ComQA

ComQA is a large dataset of real user questions that exhibit different challenging aspects such as compositionality, temporal reasoning, and comparisons. ComQA questions come from the WikiAnswers community QA platform, which typically contains questions that are not satisfactorily answerable by existing search engine technology.

9 papers0 benchmarksTexts

GeoWebNews

GeoWebNews provides test/train examples and enable fine-grained Geotagging and Toponym Resolution (Geocoding). This dataset is also suitable for prototyping and evaluating machine learning NLP models.

9 papers0 benchmarksTexts

Global Voices

Global Voices is a multilingual dataset for evaluating cross-lingual summarization methods. It is extracted from social-network descriptions of Global Voices news articles to cheaply collect evaluation data for into-English and from-English summarization in 15 languages.

9 papers0 benchmarksTexts

HappyDB

HappyDB is a corpus of 100,000 crowdsourced happy moments.

9 papers0 benchmarksTexts

IndoSum

The IndoSum dataset is a benchmark dataset for Indonesian text summarization. The dataset consists of news articles and manually constructed summaries.

9 papers0 benchmarksTexts

KnowIT VQA

KnowIT VQA is a video dataset with 24,282 human-generated question-answer pairs about The Big Bang Theory. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered.

9 papers0 benchmarksTexts, Videos

RAVEN-FAIR

RAVEN-FAIR is a modified version of the RAVEN dataset.

9 papers0 benchmarksTexts

SelQA

SelQA is a dataset that consists of questions generated through crowdsourcing and sentence length answers that are drawn from the ten most prevalent topics in the English Wikipedia.

9 papers0 benchmarksTexts

PreviousPage 47 of 158Next