Datasets

3,148 machine learning datasets

3,148 dataset results

EmpatheticDialogues

The EmpatheticDialogues dataset is a large-scale multi-turn empathetic dialogue dataset collected on the Amazon Mechanical Turk, containing 24,850 one-to-one open-domain conversations. Each conversation was obtained by pairing two crowd-workers: a speaker and a listener. The speaker is asked to talk about the personal emotional feelings. The listener infers the underlying emotion through what the speaker says and responds empathetically. The dataset provides 32 evenly distributed emotion labels.

64 papers7 benchmarksDialog, Texts

AQUA-RAT (Algebra Question Answering with Rationales)

Algebra Question Answering with Rationales (AQUA-RAT) is a dataset that contains algebraic word problems with rationales. The dataset consists of about 100,000 algebraic word problems with natural language rationales. Each problem is a json object consisting of four parts: * question - A natural language definition of the problem to solve * options - 5 possible options (A, B, C, D and E), among which one is correct * rationale - A natural language description of the solution to the problem * correct - The correct option

64 papers0 benchmarksTexts

OSCAR

OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.

64 papers0 benchmarksTexts

AIDA CoNLL-YAGO

AIDA CoNLL-YAGO contains assignments of entities to the mentions of named entities annotated for the original CoNLL 2003 entity recognition task. The entities are identified by YAGO2 entity name, by Wikipedia URL, or by Freebase mid.

64 papers0 benchmarksTexts

XL-Sum

XL-Sum is a comprehensive and diverse dataset for abstractive summarization comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.

64 papers0 benchmarksTexts

Sentence Compression

Sentence Compression is a dataset where the syntactic trees of the compressions are subtrees of their uncompressed counterparts, and hence where supervised systems which require a structural alignment between the input and output can be successfully trained.

63 papers0 benchmarksTexts

Localized Narratives

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.

63 papers7 benchmarksAudio, Images, Texts

CSL-Daily

CSL-Daily (Chinese Sign Language Corpus) is a large-scale continuous SLT dataset. It provides both spoken language translations and gloss-level annotations. The topic revolves around people's daily lives (e.g., travel, shopping, medical care), the most likely SLT application scenario.

63 papers2 benchmarksRGB Video, Texts, Videos

SLAKE

SLAKE is an English-Chinese bilingual dataset consisting of 642 images and 14,028 question-answer pairs for training and testing Med-VQA systems.

63 papers0 benchmarksImages, Medical, Texts

ComplexWebQuestions

ComplexWebQuestions is a dataset for answering complex questions that require reasoning over multiple web snippets. It contains a large set of complex questions in natural language, and can be used in multiple ways:

62 papers4 benchmarksTexts

DialogSum

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.

62 papers4 benchmarksTexts

TinyStories

Dataset containing synthetically generated (by GPT-3.5 and GPT-4) short stories that only use a small vocabulary.

62 papers0 benchmarksTexts

FigureQA

FigureQA is a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts.

61 papers0 benchmarksImages, Texts

WebQuestionsSP (WebQuestions Semantic Parses Dataset)

The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base Question Answering” [Yih, Richardson, Meek, Chang & Suh, 2016], in which we evaluated the value of gathering semantic parses, vs. answers, for a set of questions that originally comes from WebQuestions [Berant et al., 2013]. The WebQuestionsSP dataset contains full semantic parses in SPARQL queries for 4,737 questions, and “partial” annotations for the remaining 1,073 questions for which a valid parse could not be formulated or where the question itself is bad or needs a descriptive answer. This release also includes an evaluation script and the output of the STAGG semantic parsing system when trained using the full semantic parses. More detail can be found in the document and labeling instructions included in this release, as well as the paper.

61 papers4 benchmarksTexts

CIRR (Compose Image Retrieval on Real-life images)

Composed Image Retrieval (or, Image Retreival conditioned on Language Feedback) is a relatively new retrieval task, where an input query consists of an image and short textual description of how to modify the image.

61 papers12 benchmarksImages, Texts

Game of 24

Game of 24 is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations (+-*/) to obtain 24. For example, given input “4 9 10 13”, a solution output could be “(10 - 4) * (13 - 9) = 24”. We scrape data from 4nums.com, which has 1,362 games that are sorted from easy to hard by human solving time, and use a subset of relatively hard games indexed 901-1,000 for testing. For each task, we consider the output as success if it is a valid equation that equals 24 and uses the input numbers each exactly once. We report the success rate across 100 games as the metric.

61 papers1 benchmarksTexts

PreviousPage 16 of 158Next

Datasets

EmpatheticDialogues

AQUA-RAT (Algebra Question Answering with Rationales)

OSCAR

AIDA CoNLL-YAGO

XL-Sum

Sentence Compression

Localized Narratives

CSL-Daily

SLAKE

ComplexWebQuestions

DialogSum

TinyStories

FigureQA

WebQuestionsSP (WebQuestions Semantic Parses Dataset)

CIRR (Compose Image Retrieval on Real-life images)

Game of 24

ToTTo

TVQA+

SParC (Semantic Parsing in Context)

CELEX

Datasets

EmpatheticDialogues

AQUA-RAT (Algebra Question Answering with Rationales)

OSCAR

AIDA CoNLL-YAGO

XL-Sum

Sentence Compression

Localized Narratives

CSL-Daily

SLAKE

ComplexWebQuestions

DialogSum

TinyStories

FigureQA

WebQuestionsSP (WebQuestions Semantic Parses Dataset)

CIRR (Compose Image Retrieval on Real-life images)

Game of 24

ToTTo

TVQA+

SParC (Semantic Parsing in Context)

CELEX