Datasets

3,148 machine learning datasets

3,148 dataset results

satnet-sudoku (SATNet's Sudoku training test)

A set of easy Sudoku instances used in the SATNet paper for training SatNet on how to learn to play Sudoku.

rrn-sudoku (RRN sudoku instances dataset)

A set of 180,000 Sudoku grids with a variable number of hints from the minimal number of 17 (extremely hard instances) to 34 (easy instances), with 10,000 instances per level of hardness.

2 papers0 benchmarksTexts

many-solutions-sudoku (Dataset of Sudoku grids with more than one solution)

A data set of Sudoku grids with more than one solution.

2 papers0 benchmarksTexts

DEplain-APA-sent

DEplain-APA-sent: A German Parallel Corpus for Sentence Simplification on News Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

2 papers4 benchmarksTexts

DEplain-web-sent

DEplain-web-sent: A German Parallel Corpus for Sentence Simplification on Web Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

2 papers4 benchmarksTexts

FinBench

FinBench is a benchmark for evaluating the performance of machine learning models with both tabular data inputs and profile text inputs.

2 papers0 benchmarksTabular, Texts

Text2KGBench

Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

2 papers0 benchmarksTexts

PIPPA

PIPPA (Personal Interaction Pairs between People and AI) is a partially-synthetic dataset. The dataset comprises over 1 million utterances that are distributed across 26,000 conversation sessions and provides a rich resource for researchers and AI developers to explore and refine conversational AI systems in the context of role-play scenarios.

2 papers0 benchmarksTexts

StoryBench (StoryBench: A Multifaceted Benchmark for Continuous Story Visualization)

StoryBench is a multi-task benchmark to reliably evaluate the ability of text-to-video models to generate stories from a sequence of captions and their duration. It includes three datasets (DiDeMo, Oops, UVO) and three video generation tasks of increasing difficulty: action execution, where the next action must be generated starting from a conditioning video; story continuation, where a sequence of actions must be executed starting from a conditioning video; and story generation, where a video must be generated from only text prompts.

2 papers0 benchmarksTexts, Videos

FairPrism

FairPrism is a dataset of 5,000 examples of AI-generated English text with detailed human annotations covering a diverse set of harms relating to gender and sexuality. FairPrism aims to address several limitations of existing datasets for measuring and mitigating fairness-related harms, including improved transparency, clearer specification of dataset coverage, and accounting for annotator disagreement and harms that are context-dependent. FairPrism’s annotations include the extent of stereotyping and demeaning harms, the demographic groups targeted, and appropriateness for different applications. The annotations also include specific harms that occur in interactive contexts and harms that raise normative concerns when the “speaker” is an AI system. Due to its precision and granularity, FairPrism can be used to diagnose (1) the types of fairness-related harms that AI text generation systems cause, and (2) the potential limitations of mitigation methods.

2 papers0 benchmarksTexts

BiGe (Bielefeld Gesture Corpus)

The BiGe corpus is comprised of 54.360 shots of interest extracted from TED and TEDx talks. All shots are tracked with fully 3d landmarks.

2 papers0 benchmarksAudio, Point cloud, Texts

CORE (Company Relation Extraction)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarksTexts

AIDA/testc

AIDA/testc is a new challenging test set for entity linking systems containing 131 Reuters news articles published between December 5th and 7th, 2020. It links the named entity mentions in this test set to their corresponding Wikipedia pages, using the same linking procedure employed in the original AIDA CoNLL-YAGO dataset. AIDA/testc has 1,160 unique Wikipedia identifiers, spanning over 3,777 mentions and encompassing a total of 46,456 words.

2 papers1 benchmarksTexts

udhr-lid

Clean version of UDHR (Universal Declaration of Human Rights), at the long sentence level.

2 papers0 benchmarksTexts

NLP Taxonomy Classification Data

The dataset consists of titles and abstracts from NLP-related papers. Each paper is annotated with multiple fields of study from an NLP taxonomy. The training dataset contains 178,521 weakly annotated samples. The test dataset consists of 828 manually annotated samples from the EMNLP22 conference. The manually labeled test dataset might not contain all possible classes since it consists of EMNLP22 papers only, and some rarer classes haven’t been published there. Therefore, we advise creating an additional test or validation set from the train data that includes all the possible classes.

2 papers0 benchmarksTexts

Jam-ALT (JamALT: A Formatting-Aware Lyrics Transcription Benchmark)

JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.

2 papers7 benchmarksAudio, Music, Speech, Texts

LinkedPapersWithCode

An RDF knowledge graph that provides comprehensive, current information about almost 400,000 machine learning publications. This includes the tasks addressed, the datasets utilized, the methods implemented, and the evaluations conducted, along with their results. Compared to its non-RDF-based counterpart Papers With Code, LPWC not only translates the latest advancements in machine learning into RDF format, but also enables novel ways for scientific impact quantification and scholarly key content recommendation. LPWC is openly accessible and is licensed under CC-BY-SA 4.0. As a knowledge graph in the Linked Open Data cloud, we offer LPWC in multiple formats, from RDF dump files to a SPARQL endpoint for direct web queries, as well as a data source with resolvable URIs and links to the data sources SemOpenAlex, Wikidata, and DBLP. Additionally, we supply knowledge graph embeddings, enabling LPWC to be readily applied in machine learning applications.

2 papers0 benchmarksGraphs, Texts

CoVaxFrames

CoVaxFrames includes 113 Vaccine Hesitancy Framings found on Twitter about the COVID-19 vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.

2 papers0 benchmarksTexts

MMVax-Stance

MMVax-Stance includes 113 Vaccine Hesitancy Framings found on Twitter about the COVID-19 vaccines. Language experts annotated multimodal image-text tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each framing.

2 papers0 benchmarksImages, Texts

CREPE (Compositional REPresentation Evaluation)

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that—across 7 architectures trained with 4 algorithms on massive datasets—they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over 370K image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate 325K, 316K, and 309K hard negative captions for a subset of the pairs. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 183K hard neg

2 papers4 benchmarksImages, Texts

PreviousPage 97 of 158Next