Datasets

3,148 machine learning datasets

3,148 dataset results

TopiOCQA

TopiOCQA (pronounced Tapioca) is an open-domain conversational dataset with topic switches on Wikipedia. TopiOCQA contains 3,920 conversations with information-seeking questions and free-form answers. On average, a conversation in the dataset spans 13 question-answer turns and involves four topics (documents). TopiOCQA poses a challenging test-bed for models, where efficient retrieval is required on multiple turns of the same conversation, in conjunction with constructing valid responses using conversational history.

22 papers0 benchmarksTexts

iSarcasmEval

iSarcasmEval is the first shared task to target intended sarcasm detection: the data for this task was provided and labelled by the authors of the texts themselves. Such an approach minimises the downfalls of other methods to collect sarcasm data, which rely on distant supervision or third-party annotations. The shared task contains two languages, English and Arabic, and three subtasks: sarcasm detection, sarcasm category classification, and pairwise sarcasm identification given a sarcastic sentence and its non-sarcastic rephrase. The task received submissions from 60 different teams, with the sarcasm detection task being the most popular. Most of the participating teams utilised pre-trained language models. In this paper, we provide an overview of the task, data, and participating teams.

22 papers0 benchmarksTexts

XLCoST (Cross-Lingual Code Snippet)

XLCoST is a benchmark dataset for cross-lingual code intelligence. The dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-language code tasks.

22 papers0 benchmarksTexts

WebSRC (WebSRC: A Dataset for Web-Based Structural Reading Comprehension)

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no.

22 papers2 benchmarksImages, Tables, Texts

Do-Not-Answer

Do-Not-Answer is a dataset to evaluate safeguards in large language models, and deploy safer open-source LLMs at a low cost. The dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We annotate and assess the responses of six popular LLMs to these instructions.

22 papers0 benchmarksTexts

TimeQA (Time-Sensitive QA)

This dataset is aimed to study the existing reading comprehension models' capability to perform temporal reasoning, and see whether they are sensitive to the temporal description in the given question.

22 papers0 benchmarksTexts

Motion-X

Motion-X is a large-scale 3D expressive whole-body motion dataset, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes, meanwhile providing corresponding semantic labels and pose descriptions.

22 papers20 benchmarks3D, Texts

TVBench

TVBench is a new benchmark specifically created to evaluate temporal understanding in video QA. We identified three main issues in existing datasets: (i) static information from single frames is often sufficient to solve the tasks (ii) the text of the questions and candidate answers is overly informative, allowing models to answer correctly without relying on any visual input (iii) world knowledge alone can answer many of the questions, making the benchmarks a test of knowledge replication rather than visual reasoning. In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative.

22 papers1 benchmarksTexts, Videos

LIBERO-90

100 tasks from LIBERO-100 suite. Note that the datasets are split under the folder names of LIBERO-90 and LIBERO-10.

22 papers0 benchmarksActions, Images, Texts

CliCR

CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case reports.

21 papers1 benchmarksMedical, Texts

Event2Mind

Event2Mind is a corpus of 25,000 event phrases covering a diverse range of everyday events and situations.

21 papers0 benchmarksTexts

FQuAD (French Question Answering Dataset)

A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version.

21 papers2 benchmarksTexts

MIR-1K

MIR-1K (Multimedia Information Retrieval lab, 1000 song clips) is a dataset designed for singing voice separation. It contains:

21 papers0 benchmarksAudio, Texts

WIQA (What-If Question Answering)

The WIQA dataset V1 has 39705 questions containing a perturbation and a possible effect in the context of a paragraph. The dataset is split into 29808 train questions, 6894 dev questions and 3003 test questions.

21 papers0 benchmarksTexts

MultiFC

Publicly available dataset of naturally occurring factual claims for the purpose of automatic claim verification. It is collected from 26 fact checking websites in English, paired with textual sources and rich metadata, and labelled for veracity by human expert journalists.

21 papers0 benchmarksTexts

WikiSplit

Contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task.

21 papers0 benchmarksTexts

Terms of Service

The Terms of Service dataset is a law dataset corresponding to the task of identifying whether contractual terms are potentially unfair. This is a binary classification task, where positive examples are potentially unfair contractual terms (clauses) from the terms of service in consumer contracts. Article 3 of the Directive 93/13 on Unfair Terms in Consumer Contracts defines an unfair contractual term as follows. A contractual term is unfair if: (1) it has not been individually negotiated; and (2) contrary to the requirement of good faith, it causes a significant imbalance in the parties rights and obligations, to the detriment of the consumer. The Terms of Service dataset consists of 9,414 examples.

21 papers2 benchmarksTexts

KLUE (Korean Language Understanding Evaluation)

Korean Language Understanding Evaluation (KLUE) benchmark is a series of datasets to evaluate natural language understanding capability of Korean language models. KLUE consists of 8 diverse and representative tasks, which are accessible to anyone without any restrictions. With ethical considerations in mind, we deliberately design annotation guidelines to obtain unambiguous annotations for all datasets. Furthermore, we build an evaluation system and carefully choose evaluations metrics for every task, thus establishing fair comparison across Korean language models.

21 papers0 benchmarksTexts

COVID-Fact

COVID-Fact is a FEVER-like dataset of claims concerning the COVID-19 pandemic. The dataset contains claims, evidence for the claims, and contradictory claims refuted by the evidence.

21 papers0 benchmarksTexts

STREUSLE

STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. The text is from the web reviews portion of the English Web Treebank [9]. STREUSLE incorporates comprehensive annotations of multiword expressions (MWEs) [1] and semantic supersenses for lexical expressions. The supersense labels apply to single- and multiword noun and verb expressions, as described in [2], and prepositional/possessive expressions, as described in [3, 4, 5, 6, 7, 8]. Lexical expressions also feature a lexical category label indicating its holistic grammatical status; for verbal multiword expressions, these labels incorporate categories from the PARSEME 1.1 guidelines [15]. For each token, these pieces of information are concatenated together into a lextag: a sentence's words and their lextags are sufficient to recover lexical categories, supersenses, and multiword expressions [8].

21 papers4 benchmarksTexts

PreviousPage 31 of 158Next