Datasets

19,997 machine learning datasets

19,997 dataset results

Hearthstone

This dataset contains card descriptions of the card game Hearthstone and the code that implements them. These are obtained from the open-source implementation Hearthbreaker (https://github.com/danielyule/hearthbreaker).

22 papers0 benchmarksTabular

iSarcasmEval

iSarcasmEval is the first shared task to target intended sarcasm detection: the data for this task was provided and labelled by the authors of the texts themselves. Such an approach minimises the downfalls of other methods to collect sarcasm data, which rely on distant supervision or third-party annotations. The shared task contains two languages, English and Arabic, and three subtasks: sarcasm detection, sarcasm category classification, and pairwise sarcasm identification given a sarcastic sentence and its non-sarcastic rephrase. The task received submissions from 60 different teams, with the sarcasm detection task being the most popular. Most of the participating teams utilised pre-trained language models. In this paper, we provide an overview of the task, data, and participating teams.

22 papers0 benchmarksTexts

ALG514

514 algebra word problems and associated equation systems gathered from Algebra.com.

22 papers4 benchmarks

XLCoST (Cross-Lingual Code Snippet)

XLCoST is a benchmark dataset for cross-lingual code intelligence. The dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-language code tasks.

22 papers0 benchmarksTexts

Satlas

Satlas is a remote sensing dataset and benchmark that is large in both breadth, featuring all of the aforementioned applications and more, as well as scale, comprising 290M labels under 137 categories and 7 label modalities.

22 papers0 benchmarksImages

VizWiz-Classification

Our goal is to improve upon the status quo for designing image classification models trained in one domain that perform well on images from another domain. Complementing existing work in robustness testing, we introduce the first test dataset for this purpose which comes from an authentic use case where photographers wanted to learn about the content in their images. We built a new test set using 8,900 images taken by people who are blind for which we collected metadata to indicate the presence versus absence of 200 ImageNet object categories. We call this dataset VizWiz-Classification.

22 papers8 benchmarksImages

WebSRC (WebSRC: A Dataset for Web-Based Structural Reading Comprehension)

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no.

22 papers2 benchmarksImages, Tables, Texts

100STYLE

Over 4 million frames of motion capture data for 100 different styles of locomotion. Can be used for animation, human motion and sequence modelling research.

22 papers0 benchmarks

Do-Not-Answer

Do-Not-Answer is a dataset to evaluate safeguards in large language models, and deploy safer open-source LLMs at a low cost. The dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We annotate and assess the responses of six popular LLMs to these instructions.

22 papers0 benchmarksTexts

ETTh1 (96) (ETT (Electricity Transformer Temperature))

The Electricity Transformer Temperature (ETT) is a crucial indicator in the electric power long-term deployment. This dataset consists of 2 years data from two separated counties in China. To explore the granularity on the Long sequence time-series forecasting (LSTF) problem, different subsets are created, {ETTh1, ETTh2} for 1-hour-level and ETTm1 for 15-minutes-level. Each data point consists of the target value ”oil temperature” and 6 power load features. The train/val/test is 12/4/4 months.

22 papers5 benchmarks

TimeQA (Time-Sensitive QA)

This dataset is aimed to study the existing reading comprehension models' capability to perform temporal reasoning, and see whether they are sensitive to the temporal description in the given question.

22 papers0 benchmarksTexts

GSCAN (Grounded SCAN)

Grounded SCAN poses a simple task, where an agent must execute action sequences based on a synthetic language instruction.

22 papers0 benchmarks

Motion-X

Motion-X is a large-scale 3D expressive whole-body motion dataset, which comprises 15.6M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 81.1K motion sequences from massive scenes, meanwhile providing corresponding semantic labels and pose descriptions.

22 papers20 benchmarks3D, Texts

TVBench

TVBench is a new benchmark specifically created to evaluate temporal understanding in video QA. We identified three main issues in existing datasets: (i) static information from single frames is often sufficient to solve the tasks (ii) the text of the questions and candidate answers is overly informative, allowing models to answer correctly without relying on any visual input (iii) world knowledge alone can answer many of the questions, making the benchmarks a test of knowledge replication rather than visual reasoning. In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative.

22 papers1 benchmarksTexts, Videos

LIBERO-90

100 tasks from LIBERO-100 suite. Note that the datasets are split under the folder names of LIBERO-90 and LIBERO-10.

22 papers0 benchmarksActions, Images, Texts

WISE

WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic Evaluation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal understanding, and natural science.

22 papers7 benchmarks

CliCR

CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case reports.

21 papers1 benchmarksMedical, Texts

Event2Mind

Event2Mind is a corpus of 25,000 event phrases covering a diverse range of everyday events and situations.

21 papers0 benchmarksTexts

ContactDB

ContactDB is a dataset of contact maps for household objects that captures the rich hand-object contact that occurs during grasping, enabled by use of a thermal camera. ContactDB includes 3,750 3D meshes of 50 household objects textured with contact maps and 375K frames of synchronized RGB-D+thermal images.

21 papers1 benchmarks3D, Images

FQuAD (French Question Answering Dataset)

A French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version.

21 papers2 benchmarksTexts

PreviousPage 99 of 1000Next