Datasets

3,148 machine learning datasets

3,148 dataset results

QuaRTz (QuaRTz Dataset)

QuaRTz is a crowdsourced dataset of 3864 multiple-choice questions about open domain qualitative relationships. Each question is paired with one of 405 different background sentences (sometimes short paragraphs).

30 papers0 benchmarksTexts

HELP

The HELP dataset is an automatically created natural language inference (NLI) dataset that embodies the combination of lexical and logical inferences focusing on monotonicity (i.e., phrase replacement-based reasoning). The HELP (Ver.1.0) has 36K inference pairs consisting of upward monotone, downward monotone, non-monotone, conjunction, and disjunction.

30 papers0 benchmarksTexts

Holl-E

Holl-E is a dataset containing movie chats wherein each response is explicitly generated by copying and/or modifying sentences from unstructured background knowledge such as plots, comments and reviews about the movie.

30 papers0 benchmarksTexts

ECtHR (European Court of Human Rights Cases)

ECtHR is a dataset comprising European Court of Human Rights cases, including annotations for paragraph-level rationales. This dataset comprises 11k ECtHR cases and can be viewed as an enriched version of the ECtHR dataset of Chalkidis et al. (2019), which did not provide ground truth for alleged article violations (articles discussed) and rationales. It is released with silver rationales obtained from references in court decisions, and gold rationales provided by ECHR-experienced lawyers

30 papers0 benchmarksTexts

AGQA (Action Genome Question Answering)

Action Genome Question Answering (AGQA) is a benchmark for compositional spatio-temporal reasoning. AGQA contains 192M unbalanced question answer pairs for 9.6K videos. It also contains a balanced subset of 3.9M question answer pairs, 3 orders of magnitude larger than existing benchmarks, that minimizes bias by balancing the answer distributions and types of question structures.

30 papers0 benchmarksTexts, Videos

VALUE (Video-And-Language Understanding Evaluation)

VALUE is a Video-And-Language Understanding Evaluation benchmark to test models that are generalizable to diverse tasks, domains, and datasets. It is an assemblage of 11 VidL (video-and-language) datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks.

30 papers0 benchmarksTexts, Videos

CREAK

A testbed for commonsense reasoning about entity knowledge, bridging fact-checking about entities with commonsense inferences.

30 papers0 benchmarksTexts

EntityQuestions

EntityQuestions is a dataset of simple, entity-rich questions based on facts from Wikidata (e.g., "Where was Arve Furset born? ").

30 papers1 benchmarksTexts

VideoInstruct (Video Instruction Dataset)

Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of human-assisted and semi-automatic annotation techniques, aiming to produce high-quality video instruction data. These methods create question-answer pairs related to

30 papers31 benchmarksTexts, Videos

CODAH (COmmonsense Dataset Adversarially-authored by Humans)

The COmmonsense Dataset Adversarially-authored by Humans (CODAH) is an evaluation set for commonsense question-answering in the sentence completion style of SWAG. As opposed to other automatically generated NLI datasets, CODAH is adversarially constructed by humans who can view feedback from a pre-trained model and use this information to design challenging commonsense questions. It contains 2801 questions in total, and uses 5-fold cross validation for evaluation.

29 papers2 benchmarksTexts

WI-LOCNESS (Cambridge English Write & Improve & LOCNESS)

WI-LOCNESS is part of the Building Educational Applications 2019 Shared Task for Grammatical Error Correction. It consists of two datasets:

29 papers1 benchmarksTexts

Pushshift Reddit

Pushshift makes available all the submissions and comments posted on Reddit between June 2005 and April 2019. The dataset consists of 651,778,198 submissions and 5,601,331,385 comments posted on 2,888,885 subreddits.

29 papers0 benchmarksTexts

MaRVL (Multicultural Reasoning over Vision and Language)

Multicultural Reasoning over Vision and Language (MaRVL) is a dataset based on an ImageNet-style hierarchy representative of many languages and cultures (Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish). The selection of both concepts and images is entirely driven by native speakers. Afterwards, we elicit statements from native speakers about pairs of images. The task consists in discriminating whether each grounded statement is true or false.

29 papers2 benchmarksImages, Texts

MassiveText

MassiveText is a collection of large English-language text datasets from multiple sources: web pages, books, news articles, and code. The data pipeline includes text quality filtering, removal of repetitious text, deduplication of similar documents, and removal of documents with significant test-set overlap. MassiveText contains 2.35 billion documents or about 10.5 TB of text.

29 papers0 benchmarksTexts

Wizard-of-Oz

The WoZ 2.0 dataset is a newer dialogue state tracking dataset whose evaluation is detached from the noisy output of speech recognition systems. Similar to DSTC2, it covers the restaurant search domain and has identical evaluation.

28 papers2 benchmarksTexts

MR (MR Movie Reviews)

MR Movie Reviews is a dataset for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., "two and a half stars") and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity.

28 papers6 benchmarksTexts

Resume NER

Resume contains eight fine-grained entity categories -score from 74.5% to 86.88%.

28 papers3 benchmarksTexts

EVALution

EVALution dataset is evenly distributed among the three classes (hypernyms, co-hyponyms and random) and involves three types of parts of speech (noun, verb, adjective). The full dataset contains a total of 4,263 distinct terms consisting of 2,380 nouns, 958 verbs and 972 adjectives.

28 papers0 benchmarksTexts

How2QA

To collect How2QA for video QA task, the same set of selected video clips are presented to another group of AMT workers for multichoice QA annotation. Each worker is assigned with one video segment and asked to write one question with four answer candidates (one correctand three distractors). Similarly, narrations are hidden from the workers to ensure the collected QA pairs are not biased by subtitles. Similar to TVQA, the start and end points are provided for the relevant moment for each question. After filtering low-quality annotations, the final dataset contains 44,007 QA pairs for 22k 60-second clips selected from 9035 videos.

28 papers2 benchmarksTexts, Videos

DuRecDial

A human-to-human Chinese dialog dataset (about 10k dialogs, 156k utterances), which contains multiple sequential dialogs for every pair of a recommendation seeker (user) and a recommender (bot).

28 papers0 benchmarksTexts

PreviousPage 26 of 158Next