Datasets

3,148 machine learning datasets

3,148 dataset results

WikiGraphs

WikiGraphs is a dataset of Wikipedia articles each paired with a knowledge graph, to facilitate the research in conditional text generation, graph generation and graph representation learning. Existing graph-text paired datasets typically contain small graphs and short text (1 or few sentences), thus limiting the capabilities of the models that can be learned on the data.

4 papers15 benchmarksGraphs, Texts

QC-Science

QC-Science contains 47832 question-answer pairs belonging to the science domain tagged with labels of the form subject - chapter - topic. The dataset was collected with the help of a leading e-learning platform. The dataset consists of 40895 samples for training, 2153 samples for validation and 4784 samples for testing.

4 papers4 benchmarksTexts

COMPARE

COMPARE is a taxonomy and a dataset of comparison discussions in peer reviews of research papers in the domain of experimental deep learning.

4 papers0 benchmarksTexts

InferWiki

InferWiki is a Knowledge Graph Completion (KGC) dataset that improves upon existing benchmarks in inferential ability, assumptions, and patterns. First, each testing sample is predictable with supportive data in the training set. Second, InferWiki initiates the evaluation following the open-world assumption and improves the inferential difficulty of the closed-world assumption, by providing manually annotated negative and unknown triples. Third, the dataset includes various inference patterns (e.g., reasoning path length and types) for comprehensive evaluation.

4 papers0 benchmarksGraphs, Texts

AutoChart

AutoChart is a dataset for chart-to-text generation, a task that consists on generating analytical descriptions of visual plots.

4 papers0 benchmarksImages, Texts

Twitter Sentiment Analysis (Entity-Level Twitter Sentiment Analysis Dataset)

This is an entity-level Twitter Sentiment Analysis dataset. For each message, the task is to judge the sentiment of the entire sentence towards a given entity. For example, A outperforms B is positive for entity A but negative for entity B. The dataset contains ~70K labeled training messages and 1K labeled validation messages. It is available online for free on Kaggle.

4 papers2 benchmarksTexts

ConvRef

ConvRef is a conversational QA benchmark with reformulations. It consists of around 11k natural conversations with about 205k reformulations. ConvRef builds upon the conversational KG-QA benchmark ConvQuestions. Questions come from five different domains: books, movies, music, TV series and soccer and answers are Wikidata entities. We used conversation sessions in ConvQuestions as input to our user study. Study participants interacted with a baseline QA system, that was trained using the available paraphrases in ConvQuestions as proxies for reformulations. Users were shown follow-up questions in a given conversation interactively, one after the other, along with the answer coming from the baseline QA system. For wrong answers, the user was prompted to reformulate the question up to four times if needed. In this way, users were able to pose reformulations based on previous wrong answers and the conversation history.

4 papers0 benchmarksTexts

WikiNLDB

WikiNLDB is a novel dataset for training Natural Language Databases (NLDBs) which is generated by transforming structured data from Wikidata into natural language facts and queries.

4 papers0 benchmarksTexts

M5Product

The M5Product dataset is a large-scale multi-modal pre-training dataset with coarse and fine-grained annotations for E-products.

4 papers0 benchmarksAudio, Images, Tables, Texts, Videos

Bentham (Bentham project)

Bentham manuscripts refers to a large set of documents that were written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). Volunteers of the Transcribe Bentham initiative transcribed this collection. Currently, >6 000 documents or > 25 000 pages have been transcribed using this public web platform. For our experiments, we used the BenthamR0 dataset a part of the Bentham manuscripts.

4 papers2 benchmarksImages, Texts

ParaShoot

ParaShoot is the first question answering dataset in modern Hebrew. The dataset follows the format and crowdsourcing methodology of SQuAD, and contains approximately 3000 annotated examples, similar to other question-answering datasets in low-resource languages.

4 papers0 benchmarksTexts

XTD10

XTD10 is a dataset for cross-lingual image retrieval and tagging consisting of the MSCOCO2014 caption test dataset annotated in 7 languages that were collected using a crowdsourcing platform.

4 papers0 benchmarksImages, Texts

Restaurant-ACOS

The Restaurant-ACOS dataset is constructed based on the SemEval 2016 Restaurant dataset (Pontiki et al., 2016) and its expansion datasets (Fan et al., 2019; Xu et al., 2020). The SemEval 2016 Restaurant dataset (Pontiki et al., 2016) was annotated with explicit and implicit aspects, categories, and sentiment. (Fan et al., 2019; Xu et al., 2020) further added the opinion annotations. We integrate their annotations to construct aspect-category-opinion-sentiment quadruples and further annotate the implicit opinions. The Restaurant-ACOS dataset contains 2286 sentences with 3658 quadruples. It is worth noting that the Restaurant-ACOS is available for all subtasks in ABSA, including aspect-based sentiment classification, aspect-sentiment pair extraction, aspect-opinion pair extraction, aspect-opinion sentiment triple extraction, aspect-category-sentiment triple extraction, etc.

4 papers3 benchmarksTexts

Laptop-ACOS

Laptop-ACOS is a brand new Laptop dataset collected from the Amazon platform in the years 2017 and 2018 (covering ten types of laptops under six brands such as ASUS, Acer, Samsung, Lenovo, MBP, MSI, and so on). It contains 4,076 review sentences, much larger than the SemEval Laptop datasets. For Laptop-ACOS, we annotate the four elements and their corresponding quadruples all by ourselves. We employ the aspect categories defined in the SemEval 2016 Laptop dataset. The Laptop-ACOS dataset contains 4076 sentences with 5758 quadruples. As we have mentioned, a large percentage of the quadruples contain implicit aspects or implicit opinions . By comparing two datasets, it can be observed that Laptop-ACOS has a higher percentage of implicit opinions than Restaurant-ACOS . It is worth noting that the Laptop-ACOS is available for all subtasks in ABSA, including aspect-based sentiment classification, aspect-sentiment pair extraction, aspect-opinion pair extraction, aspect-opinion sentiment tri

4 papers3 benchmarksTexts

IndoNLI

IndoNLI is the first human-elicited NLI dataset for Indonesian consisting of nearly 18K sentence pairs annotated by crowd workers and experts.

4 papers0 benchmarksTexts

RTASC (ROBIN Technical Acquisition Speech Corpus)

The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format.

4 papers0 benchmarksSpeech, Tabular, Texts

RISeC (Recipe Instruction Semantics Corpus)

We propose a newly annotated dataset for information extraction on recipes. Unlike previous approaches to machine comprehension of procedural texts, we avoid a priori pre-defining domain-specific predicates to recognize (e.g., the primitive instructionsin MILK) and focus on basic understanding of the expressed semantics rather than directly reduce them to a simplified state representation.

4 papers0 benchmarksTexts

DaReCzech (Dataset for text relevance ranking in Czech)

DareCzech DaReCzech is a dataset for text relevance ranking in Czech. The dataset consists of more than 1.6M annotated query-documents pairs, which makes it one of the largest available datasets for this task.

4 papers2 benchmarksTexts

DISRPT2019 (DISRPT2019 shared task on Discourse Unit Segmentation and Connective Detection)

The DISRPT 2019 workshop introduces the first iteration of a cross-formalism shared task on discourse unit segmentation. Since all major discourse parsing frameworks imply a segmentation of texts into segments, learning segmentations for and from diverse resources is a promising area for converging methods and insights. We provide training, development and test datasets from all available languages and treebanks in the RST, SDRT and PDTB formalisms, using a uniform format. Because different corpora, languages and frameworks use different guidelines for segmentation, the shared task is meant to promote design of flexible methods for dealing with various guidelines, and help to push forward the discussion of standards for discourse units. For datasets which have treebanks, we will evaluate in two different scenarios: with and without gold syntax, or otherwise using provided automatic parses for comparison.

4 papers0 benchmarksSpeech, Texts

FACTIFY (a dataset on multi-modal fact verification)

FACTIFY is a dataset on multi-modal fact verification. It contains images, textual claim, reference textual documenta and image. The task is to classify the claims into support, not-enough-evidence and refute categories with the help of the supporting data. We aim to combat fake news in the social media era by providing this multi-modal dataset. Factify contains 50,000 claims accompanied with 100,000 images, split into training, validation and test sets.

4 papers0 benchmarksImages, Texts

PreviousPage 69 of 158Next