Datasets

3,148 machine learning datasets

3,148 dataset results

SUBJ (Subjectivity dataset)

Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

20 papers1 benchmarksTexts

JNLPBA

JNLPBA is a biomedical dataset that comes from the GENIA version 3.02 corpus (Kim et al., 2003). It was created with a controlled search on MEDLINE. From this search 2,000 abstracts were selected and hand annotated according to a small taxonomy of 48 classes based on a chemical classification. 36 terminal classes were used to annotate the GENIA corpus.

20 papers2 benchmarksTexts

George Washington

The George Washington dataset contains 20 pages of letters written by George Washington and his associates in 1755 and thereby categorized into historical collection. The images are annotated at word level and contain approximately 5,000 words.

20 papers0 benchmarksImages, Texts

RoboCup

RoboCup is an initiative in which research groups compete by enabling their robots to play football matches. Playing football requires solving several challenging tasks, such as vision, motion, and team coordination. Framing the research efforts onto football attracts public interest (and potential research funding) in robotics, which may otherwise be less entertaining to non-experts.

20 papers0 benchmarksTexts

CLOTH (CLOze test by TeacHers)

The Cloze Test by Teachers (CLOTH) benchmark is a collection of nearly 100,000 4-way multiple-choice cloze-style questions from middle- and high school-level English language exams, where the answer fills a blank in a given text. Each question is labeled with a type of deep reasoning it involves, where the four possible types are grammar, short-term reasoning, matching/paraphrasing, and long-term reasoning, i.e., reasoning over multiple sentences

20 papers0 benchmarksTexts

PubMed RCT (PubMed 200k RCT)

PubMed 200k RCT is new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: the authors hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.

20 papers0 benchmarksTexts

QuaRel

QuaRel is a crowdsourced dataset of 2771 multiple-choice story questions, including their logical forms.

20 papers0 benchmarksTexts

ETHOS (multi-labEl haTe speecH detectiOn dataSet)

ETHOS is a hate speech detection dataset. It is built from YouTube and Reddit comments validated through a crowdsourcing platform. It has two subsets, one for binary classification and the other for multi-label classification. The former contains 998 comments, while the latter contains fine-grained hate-speech annotations for 433 comments.

20 papers0 benchmarksTexts

MLQE-PE (Multilingual Quality Estimation and Automatic Post-editing Dataset)

The Multilingual Quality Estimation and Automatic Post-editing (MLQE-PE) Dataset is a dataset for Machine Translation (MT) Quality Estimation (QE) and Automatic Post-Editing (APE). The dataset contains seven language pairs, with human labels for 9,000 translations per language pair in the following formats: sentence-level direct assessments and post-editing effort, and word-level good/bad labels. It also contains the post-edited sentences, as well as titles of the articles where the sentences were extracted from, and the neural MT models used to translate the text.

20 papers0 benchmarksTexts

PreCo

A large-scale English dataset for coreference resolution. The dataset is designed to embody the core challenges in coreference, such as entity representation, by alleviating the challenge of low overlap between training and test sets and enabling separated analysis of mention detection and mention clustering.

20 papers1 benchmarksTexts

RCTW-17 (Reading Chinese Text in the Wild)

Features a large-scale dataset with 12,263 annotated images. Two tasks, namely text localization and end-to-end recognition, are set up. The competition took place from January 20 to May 31, 2017. 23 valid submissions were received from 19 teams.

20 papers0 benchmarksTexts

ScisummNet

Large-scale manually-annotated corpus for 1,000 scientific papers (on computational linguistics) for automatic summarization. Summaries for each paper are constructed from the papers that cite that paper and from that paper's abstract. Source: ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks

20 papers0 benchmarksTexts

Paralex

Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.

20 papers4 benchmarksTexts

PsyQA

PsyQA is a Chinese Dataset for generating long counseling text for mental health support.

20 papers0 benchmarksTexts

MINTAKA

MINTAKA is a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models. It is composed of 20,000 question-answer pairs collected in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish for a total of 180,000 samples. Mintaka includes 8 types of complex questions, including superlative, intersection, and multi-hop questions, which were naturally elicited from crowd workers.

20 papers0 benchmarksTexts

PhotoChat

PhotoChat, the first dataset that casts light on the photo sharing behavior in online messaging. PhotoChat contains 12k dialogues, each of which is paired with a user photo that is shared during the conversation. Based on this dataset, we propose two tasks to facilitate research on image-text modeling: a photo-sharing intent prediction task that predicts whether one intends to share a photo in the next conversation turn, and a photo retrieval task that retrieves the most relevant photo according to the dialogue context.

20 papers10 benchmarksImages, Texts

FNC-1 (Fake News Challenge Stage 1)

FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are labelled as either agree, disagree, discuss, and unrelated. Each headline in the dataset is phrased as a statement

19 papers6 benchmarksTexts

DIRHA (Distant-speech Interaction for Robust Home Applications)

DIRHA-English is a multi-microphone database composed of real and simulated sequences of 1-minute. The overall corpus is composed of different types of sequences including: 1) Phonetically-rich sentences; 2) WSJ 5-k utterances; 3) WSJ 20-k utterances; 4) Conversational speech (also including keywords and commands). The sequences are available for both UK and US English at 48 kHz. The DIRHA-English dataset offers the possibility to work with a very large number of microphone channels, to use of microphone arrays having different characteristics and to work considering different speech recognition tasks (e.g., phone-loop, keyword spotting, ASR with small and very large language models).

19 papers0 benchmarksAudio, Texts

Senseval-2

There are now many computer programs for automatically determining the sense of a word in context (Word Sense Disambiguation or WSD). The purpose of SENSEVAL is to evaluate the strengths and weaknesses of such programs with respect to different words, different varieties of language, and different languages.

19 papers0 benchmarksTexts

Mathematics Dataset

This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models.

19 papers1 benchmarksTexts

PreviousPage 32 of 158Next