Datasets

3,148 machine learning datasets

3,148 dataset results

CosmosQA

CosmosQA is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people’s everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context.

102 papers0 benchmarksTexts

QASPER

QASPER is a dataset for question answering on scientific research papers. It consists of 5,049 questions over 1,585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers.

102 papers1 benchmarksTexts

ConvAI2 (Conversational Intelligence Challenge 2)

The ConvAI2 NeurIPS competition aimed at finding approaches to creating high-quality dialogue agents capable of meaningful open domain conversation. The ConvAI2 dataset for training models is based on the PERSONA-CHAT dataset. The speaker pairs each have assigned profiles coming from a set of 1155 possible personas (at training time), each consisting of at least 5 profile sentences, setting aside 100 never seen before personas for validation. As the original PERSONA-CHAT test set was released, a new hidden test set consisted of 100 new personas and over 1,015 dialogs was created by crowdsourced workers.

100 papers6 benchmarksDialog, Texts

GenEval

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to

100 papers28 benchmarksImages, Texts

CLUE (Chinese Language Understanding Evaluation Benchmark)

CLUE is a Chinese Language Understanding Evaluation benchmark. It consists of different NLU datasets. It is a community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.

99 papers0 benchmarksTexts

TextCaps

Contains 145k captions for 28k images. The dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.

98 papers0 benchmarksImages, Texts

QuALITY (Question Answering with Long Input Texts, Yes!)

QuALITY (Question Answering with Long Input Texts, Yes!) is a multiple-choice question answering dataset for long document comprehension. The dataset consists of context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, the questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts.

98 papers1 benchmarksTexts

ListOps

The ListOps examples are comprised of summary operations on lists of single digit integers, written in prefix notation. The full sequence has a corresponding solution which is also a single-digit integer, thus making it a ten-way balanced classification problem. For example, [MAX 2 9 [MIN 4 7 ] 0 ] has the solution 9. Each operation has a corresponding closing square bracket that defines the list of numbers for the operation. In this example, MIN operates on {4, 7}, while MAX operates on {2, 9, 4, 0}.

97 papers0 benchmarksTexts

FIGER (Fine-Grained Entity Recognition)

The FIGER dataset is an entity recognition dataset where entities are labelled using fine-grained system 112 tags, such as person/doctor, art/written_work and building/hotel. The tags are derivied from Freebase types. The training set consists of Wikipedia articles automatically annotated with distant supervision approach that utilizes the information encoded in anchor links. The test set was annotated manually.

96 papers5 benchmarksTexts

RAVEN

RAVEN consists of 1,120,000 images and 70,000 RPM (Raven's Progressive Matrices) problems, equally distributed in 7 distinct figure configurations.

96 papers0 benchmarksImages, Texts

Math23K (Math23K for Math Word Problem Solving)

Math23K is a dataset created for math word problem solving, contains 23, 162 Chinese problems crawled from the Internet. Refer to our paper for more details: The dataset is originally introduced in the paper Deep Neural Solver for Math Word Problems. The original files are originally split into train/test split, while other research efforts (https://github.com/2003pro/Graph2Tree) perform the train/dev/test split.

95 papers12 benchmarksTexts

CUHK-PEDES

The CUHK-PEDES dataset is a caption-annotated pedestrian dataset. It contains 40,206 images over 13,003 persons. Images are collected from five existing person re-identification datasets, CUHK03, Market-1501, SSM, VIPER, and CUHK01 while each image is annotated with 2 text descriptions by crowd-sourcing workers. Sentences incorporate rich details about person appearances, actions, poses.

93 papers15 benchmarksImages, Texts

TGIF-QA

The TGIF-QA dataset contains 165K QA pairs for the animated GIFs from the TGIF dataset [Li et al. CVPR 2016]. The question & answer pairs are collected via crowdsourcing with a carefully designed user interface to ensure quality. The dataset can be used to evaluate video-based Visual Question Answering techniques.

92 papers5 benchmarksTexts, Videos

CBT (Children’s Book Test)

Children’s Book Test (CBT) is designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg.

92 papers0 benchmarksTexts

JFLEG (JHU FLuency-Extended GUG corpus)

JFLEG is for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language proficiency levels and uses holistic fluency edits to not only correct grammatical errors but also make the original text more native sounding.

91 papers2 benchmarksTexts

WikiMatrix

WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:

91 papers0 benchmarksTexts

PopQA

PopQA is an open-domain QA dataset with 14k QA pairs with fine-grained Wikidata entity ID, Wikipedia page views, and relationship type information.

91 papers1 benchmarksTexts

E2E (End-to-End NLG Challenge)

End-to-End NLG Challenge (E2E) aims to assess whether recent end-to-end NLG systems can generate more complex output by learning from datasets containing higher lexical richness, syntactic complexity and diverse discourse phenomena.

90 papers11 benchmarksTexts

ST-VQA (Scene Text Visual Question Answering)

ST-VQA aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process.

90 papers0 benchmarksImages, Texts

Winoground

Winoground is a dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. Given two images and two captions, the goal is to match them correctly -- but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance.

90 papers3 benchmarksImages, Texts

PreviousPage 12 of 158Next