3,148 machine learning datasets
3,148 dataset results
Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification.
Contains ~9K videos of human agents performing various actions, annotated with 3 types of commonsense descriptions.
The WordNet Language Model Probing (WNLaMPro) dataset consists of relations between keywords and words. It contains 4 different kinds of relations: Antonym, Hypernym, Cohyponym and Corruption.
JParaCrawl is a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. The parallel corpus was constructed by broadly crawling the web and automatically aligning parallel sentences. The corpus amassed over 8.7 million sentence pairs.
French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps
The IBM-Rank-30k is a dataset for the task of argument quality ranking. It is a corpus of 30,497 arguments carefully annotated for point-wise quality.
A question type classification dataset with 6 classes for questions about a person, location, numeric information, etc. The test split has 500 questions, and the training split has 5452 questions.
Cant (also known as doublespeak, cryptolect, argot, anti-language or secret language) is important for understanding advertising, comedies and dog-whistle politics. DogWhistle is a large and diverse Chinese dataset for creating and understanding cant from a computational linguistics perspective.
IIIT-ILST is a dataset and benchmark for scene text recognition for three Indic scripts - Devanagari, Telugu and Malayalam. IIIT-ILST contains nearly 1000 real images per each script which are annotated for scene text bounding boxes and transcriptions.
There are two versions of the NLmaps corpus. NLmaps (v1) and its extension NLmaps v2. Both versions of the NLmaps corpus consist of questions about geographical facts that can be answered with the OpenStreetMap (OSM) database (available under the Open Database Licence). The questions are in English and have a corresponding Machine Readable Language (MRL) parse. Gold answers can be obtained by executing the gold parses against the OSM database using the NLmaps backend, which is based on the Overpass-API (available under the Affero GPL v3).
QA-SRL Bank 2.0 is a large-scale corpus of Question-Answer driven Semantic Role Labeling (QA-SRL) annotations. The corpus consists of over 250,000 question-answer pairs for over 64,000 sentences across 3 domains and was gathered with a new crowd-sourcing scheme that was shown to have high precision and good recall at modest cost.
XL-BEL is a benchmark for cross-lingual biomedical entity linking (XL-BEL). The benchmark spans 10 typologically diverse languages.
OTTers is a dataset of human one-turn topic transitions. In this task, models must connect two topics in a cooperative and coherent manner, by generating a "bridging" utterance connecting the new topic tot he topic of the previous conversation turn.
ConvoSumm is a suite of four datasets to evaluate a model’s performance on a broad spectrum of conversation data.
Swords (Standford Word Substitution) is a benchmark for lexical substitution, the task of finding appropriate substitutes for a target word in a context. Swords is composed of context, target word, and substitute triples (c, w, w'), each of which has a score that indicates the appropriateness of the substitute.
JerichoWorld is a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive narratives. Interactive narratives -- or text-adventure games -- are partially observable environments structured as long puzzles or quests in which an agent perceives and interacts with the world purely through textual natural language. Each individual game typically contains hundreds of locations, characters, and objects -- each with their own unique descriptions -- providing an opportunity to study the problem of giving language-based agents the structured memory necessary to operate in such worlds.
We present a reading comprehension challenge in which questions can only be answered by taking into account information from multiple sentences. The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills.
The Russian Commitment Bank is a corpus of naturally occurring discourses whose final sentence contains a clause-embedding predicate under an entailment cancelling operator (question, modal, negation, antecedent of conditional).
The ZS-F-VQA dataset is a new split of the F-VQA dataset for zero-shot problem. Firstly we obtain the original train/test split of F-VQA dataset and combine them together to filter out the triples whose answers appear in top-500 according to its occurrence frequency. Next, we randomly divide this set of answers into new training split (a.k.a. seen) $\mathcal{A}_s$ and testing split (a.k.a. unseen) $\mathcal{A}_u$ at the ratio of 1:1. With reference to F-VQA standard dataset, the division process is repeated 5 times. For each $(i,q,a)$ triplet in original F-VQA dataset, it is divided into training set if $a \in \mathcal{A}_s$. Else it is divided into testing set. The overlap of answer instance between training and testing set in F-VQA are $2565$ compared to $0$ in ZS-F-VQA.
BMELD is a bilingual (English-Chinese) dialogue corpus for Neural chat translation.