Datasets

3,148 machine learning datasets

3,148 dataset results

SIMMC2.0

Next generation task-oriented dialog systems need to understand conversational contexts with their perceived surroundings, to effectively help users in the real-world multimodal environment. Existing task-oriented dialog datasets aimed towards virtual assistance fall short and do not situate the dialog in the user's multimodal context. To overcome, we present a new dataset for Situated and Interactive Multimodal Conversations, SIMMC 2.0, which includes 11K task-oriented user<->assistant dialogs (117K utterances) in the shopping domain, grounded in immersive and photo-realistic scenes. The dialogs are collected using a two-phase pipeline: (1) A novel multimodal dialog simulator generates simulated dialog flows, with an emphasis on diversity and richness of interactions, (2) Manual paraphrasing of the generated utterances to collect diverse referring expressions. We provide an in-depth analysis of the collected dataset, and describe in detail the four main benchmark tasks we propose. Our

13 papers3 benchmarksImages, Texts

GUE (Genome Understanding Evaluation)

A collection of $28$ datasets across $7$ tasks constructed for genome language model evaluation. Contains seven tasks: promoter prediction. core promoter prediction, splice site prediction, covid variant classification, epigenetic marks prediction, and transcription factor binding sites prediction on human and mouse.

13 papers6 benchmarksMedical, Texts

The COLOSSEUM (The COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation)

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup.

13 papers1 benchmarksImages, Texts

MedConceptsQA

MedConceptsQA - Open Source Medical Concepts QA Benchmark

13 papers3 benchmarksTexts

Amazon Baby (Amazon Baby 5-core)

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

13 papers4 benchmarksImages, Texts

Ghostbuster

The Ghostbusters dataset leverages the GPT-3.5-turbo model for generating texts in the domains of creative writing, news, and student essays, providing 2,000 texts in the first two domains and 1,994 in the latter.

13 papers0 benchmarksTexts

OUTFOX

It contains 15K triplets of essay problem statements, student-written, and LLM-generated essays.

13 papers0 benchmarksTexts

Hutter Prize

The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.

12 papers2 benchmarksTexts

Lani

LANI is a 3D navigation environment and corpus, where an agent navigates between landmarks. Lani contains 27,965 crowd-sourced instructions for navigation in an open environment. Each datapoint includes an instruction, a human-annotated ground-truth demonstration trajectory, and an environment with various landmarks and lakes. The dataset train/dev/test split is 19,758/4,135/4,072. Each environment specification defines placement of 6–13 landmarks within a square grass field of size 50m×50m.

12 papers0 benchmarksEnvironment, Texts

DSTC7 Task 1 (Dialog System Technology Challenges Task 1)

The DSTC7 Task 1 dataset is a dataset and task for goal-oriented dialogue. The data originates from human-human conversations, which is built from online resources, specifically the Ubuntu Internet Relay Chat (IRC) channel and an Advising dataset from the University of Michigan.

12 papers0 benchmarksTexts

VizWiz-Captions

Consists of over 39,000 images originating from people who are blind that are each paired with five captions.

12 papers0 benchmarksImages, Texts

COVID-19 Fake News Dataset (COVID19 Fake News Detection in English)

Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm. This is further exacerbated at the time of a pandemic. To tackle this, we curate and release a manually annotated dataset of 10,700 social media posts and articles of real and fake news on COVID-19. We benchmark the annotated dataset with four machine learning baselines - Decision Tree, Logistic Regression , Gradient Boost , and Support Vector Machine (SVM). We obtain the best performance of 93.46\% F1-score with SVM.

12 papers1 benchmarksTexts

CDCP (Cornell eRulemaking Corpus)

The Cornell eRulemaking Corpus – CDCP is an argument mining corpus annotated with argumentative structure information capturing the evaluability of arguments. The corpus consists of 731 user comments on Consumer Debt Collection Practices (CDCP) rule by the Consumer Financial Protection Bureau (CFPB); the resulting dataset contains 4931 elementary unit and 1221 support relation annotations. It is a resource for building argument mining systems that can not only extract arguments from unstructured text, but also identify what additional information is necessary for readers to understand and evaluate a given argument. Immediate applications include providing real-time feedback to commenters, specifying which types of support for which propositions can be added to construct better-formed arguments.

12 papers6 benchmarksTexts

FM2 (FoolMeTwice)

FoolMeTwice (FM2 for short) is a large dataset of challenging entailment pairs collected through a fun multi-player game. Gamification encourages adversarial examples, drastically lowering the number of examples that can be solved using "shortcuts" compared to other popular entailment datasets. Players are presented with two tasks. The first task asks the player to write a plausible claim based on the evidence from a Wikipedia page. The second one shows two plausible claims written by other players, one of which is false, and the goal is to identify it before the time runs out. Players "pay" to see clues retrieved from the evidence pool: the more evidence the player needs, the harder the claim. Game-play between motivated players leads to diverse strategies for crafting claims, such as temporal inference and diverting to unrelated evidence, and results in higher quality data for the entailment and evidence retrieval tasks.

12 papers0 benchmarksTexts

Overruling

The Overruling dataset is a law dataset corresponding to the task of determining when a sentence is overruling a prior decision. This is a binary classification task, where positive examples are overruling sentences and negative examples are non-overruling sentences extracted from legal opinions. In law, an overruling sentence is a statement that nullifies a previous case decision as a precedent, by a constitutionally valid statute or a decision by the same or higher ranking court which establishes a different rule on the point of law involved. The Overruling dataset consists of 2,400 sentences.

12 papers2 benchmarksTexts

MM-COVID (Multilingual and Multidimensional COVID-19 Fake News Data Repository)

MM-COVID is a dataset for fake news detection related to COVID-19. This dataset provides the multilingual fake news and the relevant social context. It contains 3,981 pieces of fake news content and 7,192 trustworthy information from English, Spanish, Portuguese, Hindi, French and Italian, 6 different languages.

12 papers0 benchmarksTexts

GLGE (General Language Generation Evaluation)

GLGE is a general language generation evaluation benchmark which is composed of 8 language generation tasks, including Abstractive Text Summarization (CNN/DailyMail, Gigaword, XSUM, MSNews), Answer-aware Question Generation (SQuAD 1.1, MSQG), Conversational Question Answering (CoQA), and Personalizing Dialogue (Personachat).

12 papers0 benchmarksTexts

PROST (Physical Reasoning about Objects Through Space and Time)

The PROST (Physical Reasoning about Objects Through Space and Time) dataset contains 18,736 multiple-choice questions made from 14 manually curated templates, covering 10 physical reasoning concepts. All questions are designed to probe both causal and masked language models in a zero-shot setting.

12 papers0 benchmarksTexts

FewCLUE

Chinese Few-shot Learning Evaluation Benchmark (FewCLUE) is a comprehensive small sample evaluation benchmark in Chinese. It includes nine tasks, ranging from single-sentence and sentence-pair classification tasks to machine reading comprehension tasks.

12 papers0 benchmarksTexts

CDR (BioCreative V CDR Task Corpus)

The BioCreative V CDR task corpus is manually annotated for chemicals, diseases and chemical-induced disease (CID) relations. It contains the titles and abstracts of 1500 PubMed articles and is split into equally sized train, validation and test sets. It is common to first tune a model on the validation set and then train on the combination of the train and validation sets before evaluating on the test set. It is also common to filter negative relations with disease entities that are hypernyms of a corresponding true relations disease entity within the same abstract (see Appendix C of this paper for details).

12 papers3 benchmarksTexts

PreviousPage 41 of 158Next