Datasets

3,148 machine learning datasets

3,148 dataset results

TIAGE

TIAGE is a topic-shift aware dialog benchmark constructed utilizing human annotations on topic shifts. Based on TIAGE, three tasks can be conducted to investigate different scenarios of topic-shift modeling in dialog settings: topic-shift detection, topic-shift triggered response generation and topic-aware dialog generation.

7 papers0 benchmarksTexts

GINC (Generative IN-Context learning Dataset)

GINC (Generative In-Context learning Dataset) is a small-scale synthetic dataset for studying in-context learning. The pretraining data is generated by a mixture of HMMs and the in-context learning prompt examples are also generated from HMMs (either from the mixture or not). The prompt examples are out-of-distribution with respect to the pretraining data since every example is independent, concatenated, and separated by delimiters. The GitHub repository provides code to generate GINC-style datasets of varying vocabulary sizes, number of HMMs, and other parameters.

7 papers0 benchmarksTexts

PhoMT

PhoMT is a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs for machine translation.

7 papers1 benchmarksTexts

SLAKE-English

English subset of the SLAKE dataset, comprising 642 images and more than 7,000 question–answer pairs.

7 papers0 benchmarksImages, Medical, Texts

Data Science Problems

Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant for more details about state of the art results and other properties of the dataset.

7 papers0 benchmarksTexts

USR-TopicalChat

This dataset was collected with the goal of assessing dialog evaluation metrics. In the paper, USR: An Unsupervised and Reference Free Evaluation Metric for Dialog (Mehri and Eskenazi, 2020), the authors collect this data to measure the quality of several existing word-overlap and embedding-based metrics, as well as their newly proposed USR metric.

7 papers4 benchmarksTexts

USR-PersonaChat

7 papers4 benchmarksTexts

RefSeer

A data set containing citations, citation contexts, and papers.

7 papers0 benchmarksTexts

PET (PET: A new Dataset for Process Extraction from Natural Language Text)

The dataset contains 45 documents containing narrative description of business process and their annotations. Annotated with activities, gateways, actors, and flow information.

7 papers0 benchmarksTexts

New3

New3, a set of 527 instances from AMR 3.0, whose original source was the LORELEI DARPA project – not included in the AMR 2.0 training set – consisting of excerpts from newswires and online forum.

7 papers2 benchmarksGraphs, Texts

PDNC (Project Dialogism Novel Corpus)

A annotated dataset of quotations and within-quotation-mentions in 22 full-length English novels.

7 papers0 benchmarksTexts

ACES (A Translation Accuracy Challenge Set)

ACES a dataset consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. It can be used to evaluate a wide range of Machine Translation metrics.

7 papers1 benchmarksTexts

Demetr

Demetr is a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories.

7 papers0 benchmarksTexts

HyperRED (Hyper-Relational Extraction Dataset)

HyperRED is a dataset for the new task of hyper-relational extraction, which extracts relation triplets together with qualifier information such as time, quantity or location. For example, the relation triplet (Leonard Parker, Educated At, Harvard University) can be factually enriched by including the qualifier (End Time, 1967). HyperRED contains 44k sentences with 62 relation types and 44 qualifier types.

7 papers1 benchmarksTexts

CEFR-SP

CEFR-SP contains 17k English sentences annotated with the levels based on the Common European Framework of Reference for Languages assigned by English-education professionals.

7 papers0 benchmarksTexts

LEVEN (Legal Event Detection Dataset)

Overview LEVEN is the largest Legal Event Detection dataset as well as the largest Chinese Event Detection dataset.

7 papers0 benchmarksTexts

HBW (Human Bodies in the Wild)

Human Bodies in the Wild (HBW) is a validation and test set for body shape estimation. It consists of images taken in the wild and ground truth 3D body scans in SMPL-X topology. To create HBW, we collect body scans of 35 participants and register the SMPL-X model to the scans. Further each participant is photographed in various outfits and poses in front of a white background and uploads full-body photos of themselves taken in the wild. The validation and test set images are released. The ground truth shape is only released for the validation set.

7 papers0 benchmarks3D, Images, Texts

GHOSTS

GHOSTS is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. It a collection of multiple datasets of prompts, totalling 728 prompts, for which ChatGPT’s output was manually rated by experts.

7 papers0 benchmarksTexts

TACRED-Revisited

The TACRED-Revisited dataset improves the crowd-sourced TACRED dataset for relation extraction by relabeling the dev and test sets using expert linguistic annotators. Relabeling focuses on the 5K most challenging instances in dev and test, in total, 51.2% of these are corrected. Published at ACL 2020.

7 papers1 benchmarksTexts

HiREST (HIerarchical REtrieval and STep-captioning)

HiREST (HIerarchical REtrieval and STep-captioning) dataset is a benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. It consists of 3.4K text-video pairs from a video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions). The dataset consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.

7 papers0 benchmarksTexts, Videos

PreviousPage 54 of 158Next