Datasets

3,148 machine learning datasets

3,148 dataset results

OASST1 (OpenAssistant Conversations Dataset)

license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations

RecipeQA

RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.

24 papers1 benchmarksImages, Texts

WMT 2016 News (WMT 2016 News Translation Task)

News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, Romanian, Russian, Turkish) and additional 1500 sentences from each of the 5 languages translated to English. For Romanian a third of the test set were released as a development set instead. For Turkish additional 500 sentence development set was released. The sentences were selected from dozens of news websites and translated by professional translators. The training data consists of parallel corpora to train translation models, monolingual corpora to train language models and development sets for tuning. Some training corpora were identical from WMT 2015 (Europarl, United Nations, French-English 10⁹ corpus, Common Crawl, Russian-English parallel data provided by Yandex, Wikipedia Headlines provided by CMU) and some were update (CzEng v1.6pre, News Commentary v11, monolingual news data). Additionally,

24 papers0 benchmarksParallel, Texts

MCScript

MCScript is used as the official dataset of SemEval2018 Task11. This dataset constructs a collection of text passages about daily life activities and a series of questions referring to each passage, and each question is equipped with two answer choices. The MCScript comprises 9731, 1411, and 2797 questions in training, development, and test set respectively.

24 papers0 benchmarksTexts

TechQA (The TechQA Dataset)

TECHQA is a domain-adaptation question answering dataset for the technical support domain. The TECHQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size – 600 training, 310 dev, and 490 evaluation question/answer pairs – thus reflecting the cost of creating large labeled datasets with actual data. Consequently, TECHQA is meant to stimulate research in domain adaptation rather than being a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote—a technical document that addresses a specific technical issue.

24 papers0 benchmarksTexts

ROPES (Reasoning Over Paragraph Effects in Situations)

ROPES is a QA dataset which tests a system's ability to apply knowledge from a passage of text to a new situation. A system is presented a background passage containing a causal or qualitative relation(s), a novel situation that uses this background, and questions that require reasoning about effects of the relationships in the back-ground passage in the context of the situation.

24 papers0 benchmarksTexts

KELM

KELM is a large-scale synthetic corpus of Wikidata KG as natural text.

24 papers0 benchmarksTexts

Moral Stories

Moral Stories is a crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations, and their respective consequences.

24 papers0 benchmarksTexts

Natural Stories

The Natural Stories dataset consists of English texts edited to contain many low-frequency syntactic constructions while still sounding fluent to native speakers. The corpus is annotated with hand-corrected parse trees and includes self-paced reading time data.

24 papers0 benchmarksTexts

SIMMC (Situated and Interactive Multimodal Conversations)

Situated Interactive MultiModal Conversations (SIMMC) is the task of taking multimodal actions grounded in a co-evolving multimodal input content in addition to the dialog history. This dataset contains two SIMMC datasets totalling ~13K human-human dialogs (~169K utterances) using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture (grounded in a shared virtual environment) and (b) fashion (grounded in an evolving set of images).

24 papers0 benchmarksTexts

GeoS

GeoS is a dataset for automatic math problem solving. It is a dataset of SAT plane geometry questions where every question has a textual description in English accompanied by a diagram and multiple choices. Questions and answers are compiled from previous official SAT exams and practice exams offered by the College Board. We annotate ground-truth logical forms for all questions in the dataset.

24 papers2 benchmarksTexts

FlickrStyle10K

FlickrStyle10K is collected and built on Flickr30K image caption dataset. The original FlickrStyle10K dataset has 10,000 pairs of images and stylized captions including humorous and romantic styles. However, only 7,000 pairs from the ofﬁcial training set are now publicly accessible. The dataset can be downloaded via https://zhegan27.github.io/Papers/FlickrStyle_v0.9.zip

24 papers3 benchmarksImages, Texts

ASTE-Data-V2

A benchmark dataset for the Aspect Sentiment Triplet Extraction, an updated version of ASTE-Data-V1.

24 papers1 benchmarksTexts

Wiki-ZSL

The Wiki-ZSL (Wiki Zero-Shot Learning) dataset contains 113 relations and 94,383 instances from Wikipedia. The dataset is divided into three subsets: training set (98 relations), validation set (5 relations) and test set (10 relations).

24 papers1 benchmarksTexts

InterHuman

InterHuman is a multimodal dataset, named InterHuman. It consists of about 107M frames for diverse two-person interactions, with accurate skeletal motions and 16,756 natural language descriptions.

24 papers16 benchmarksImages, Texts

NetHack Learning Environment

The NetHack Learning Environment (NLE) is a Reinforcement Learning environment based on NetHack 3.6.6. It is designed to provide a standard reinforcement learning interface to the game, and comes with tasks that function as a first step to evaluate agents on this new environment. NetHack is one of the oldest and arguably most impactful videogames in history, as well as being one of the hardest roguelikes currently being played by humans. It is procedurally generated, rich in entities and dynamics, and overall an extremely challenging environment for current state-of-the-art RL agents, while being much cheaper to run compared to other challenging testbeds. Through NLE, the authors wish to establish NetHack as one of the next challenges for research in decision making and machine learning.

23 papers1 benchmarksEnvironment, Texts

PreviousPage 29 of 158Next

Datasets

OASST1 (OpenAssistant Conversations Dataset)

RecipeQA

WMT 2016 News (WMT 2016 News Translation Task)

MCScript

TechQA (The TechQA Dataset)

ROPES (Reasoning Over Paragraph Effects in Situations)

KELM

Moral Stories

Natural Stories

SIMMC (Situated and Interactive Multimodal Conversations)

GeoS

FlickrStyle10K

ASTE-Data-V2

Wiki-ZSL

InterHuman

NetHack Learning Environment

RECCON

LitBank

EURLEX57K

HeadQA

Datasets

OASST1 (OpenAssistant Conversations Dataset)

RecipeQA

WMT 2016 News (WMT 2016 News Translation Task)

MCScript

TechQA (The TechQA Dataset)

ROPES (Reasoning Over Paragraph Effects in Situations)

KELM

Moral Stories

Natural Stories

SIMMC (Situated and Interactive Multimodal Conversations)

GeoS

FlickrStyle10K

ASTE-Data-V2

Wiki-ZSL

InterHuman

NetHack Learning Environment

RECCON

LitBank

EURLEX57K

HeadQA