Datasets

3,148 machine learning datasets

3,148 dataset results

Mint (Multilingual Intimacy analysis)

Mint is a new Multilingual intimacy analysis dataset covering 13,384 tweets in 10 languages including English, French, Spanish, Italian, Portuguese, Korean, Dutch, Chinese, Hindi, and Arabic. The dataset is released along with the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis.

1 papers0 benchmarksTexts

Lipogram-e

This is a dataset of 3 English books which do not contain the letter "e" in them. This dataset includes all of "Gadsby" by Ernest Vincent Wright, all of "A Void" by Georges Perec, and almost all of "Eunoia" by Christian Bok (except for the single chapter that uses the letter "e" in it)

1 papers2 benchmarksTexts

Demosthenes

Corpus for argument mining in legal documents, composed of 40 decisions of the Court of Justice of the European Union on matters of fiscal state aid

1 papers0 benchmarksTexts

XiaChuFang Recipe Corpus

XiaChuFang Recipe Corpus contains recipes are from 下厨房 (XiaChuFang), a popular Chinese recipe sharing website. The full recipe corpus contains 1,520,327 Chinese recipes. Among them, 1,242,206 recipes belong to 30,060 dishes. A dish has 41.3 recipes on average.

1 papers0 benchmarksTexts

The Reddit Climate Change Dataset

The Reddit Climate Change Dataset is a dataset of 620K Reddit posts and 4.6M comments - all mentions of the terms "climate" and "change" until 2022-09-01 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.

1 papers0 benchmarksTabular, Texts

RGZ EMU: Semantic Taxonomy (Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy)

The data used in - "Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy" (Bowles et al. submitted) - "A New Task: Deriving Semantic Class Targets for the Physical Sciences" (Bowles et al. 2022: https://arxiv.org/abs/2210.14760) accepted at the Fifth Workshop on Machine Learning and the Physical Sciences, Neural Information Processing Systems 2022.

1 papers0 benchmarksImages, Tabular, Texts

LM Email Address Leakage

Are Large Pre-Trained Language Models Leaking Your Personal Information? We analyze whether Pre-Trained Language Models (PLMs) are prone to leaking personal information. Specifically, we query PLMs for email addresses with contexts of the email address or prompts containing the owner's name.

1 papers0 benchmarksTexts

Financial Language Understanding Evaluation

Financial Language Understanding Evaluation is an open-source comprehensive suite of benchmarks for the financial domain. It contains benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. The tasks are financial sentiment analysis, news headline classification, named entity recognition, structure boundary detection and question answering.

1 papers0 benchmarksTexts

E2E Refined

E2E Refined is a dataset for sentence classification. It consists of 40,560 examples for training, 4,489 for validation, and 4,555 for test. It is a refined version of the well-known MR-to-text E2E dataset where many deletion/insertion/substitution errors has been fixed.

1 papers0 benchmarksTexts

JECC (Jericho Environment Commonsense Comprehension)

Jericho Environment Commonsense Comprehension (JECC) is a dataset for commonsense reasoning. It consists of 29 games in multiple domains from the Jericho Environment hausknecht2019interactive.

1 papers0 benchmarksTexts

NLI4Wills Corpus

NLI4Wills Corpus can be used to train transformers and sentence-transformer models for the validity evaluation of the legal will statements. Our dataset consists of ID numbers, three types of inputs (legal will statements, laws, and conditions) and classifications (support, refute, or unrelated).

1 papers0 benchmarksTexts

TempWikiBio

TempWikiBio is a new data-to-text generation dataset containing more than 4 millions of chronologically ordered revisions of biographical articles from English Wikipedia, each paired with structured personal profiles.

1 papers0 benchmarksTexts

EventEA

EventEA is an event-centric entity alignment dataset, harvested from EventKG, DBpedia and Wikidata.

1 papers0 benchmarksTexts

NJH (Not Just Hate)

NJH is a dataset of over 40,000 tweets about immigration from the US and UK, annotated with six labels for different aspects of incivility and intolerance. It is a more fine-grained multi-label approach to predicting incivility and hateful or intolerant content.

1 papers0 benchmarksTexts

PreviousPage 121 of 158Next

Datasets

Mint (Multilingual Intimacy analysis)

Lipogram-e

Demosthenes

XiaChuFang Recipe Corpus

The Reddit Climate Change Dataset

RGZ EMU: Semantic Taxonomy (Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy)

LM Email Address Leakage

Financial Language Understanding Evaluation

E2E Refined

JECC (Jericho Environment Commonsense Comprehension)

NLI4Wills Corpus

TempWikiBio

EventEA

NJH (Not Just Hate)

CoRAL dataset (CoRAL: a Context-aware Croatian Abusive Language Dataset)

ICLR Database (ICLR Database (with Textual Covariates))

#chinahate

BWB

KGRED (Knowledge-graph-enhanced relation extraction datasets--)

Ambiguous VQA

Datasets

Mint (Multilingual Intimacy analysis)

Lipogram-e

Demosthenes

XiaChuFang Recipe Corpus

The Reddit Climate Change Dataset

RGZ EMU: Semantic Taxonomy (Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy)

LM Email Address Leakage

Financial Language Understanding Evaluation

E2E Refined

JECC (Jericho Environment Commonsense Comprehension)

NLI4Wills Corpus

TempWikiBio

EventEA

NJH (Not Just Hate)

CoRAL dataset (CoRAL: a Context-aware Croatian Abusive Language Dataset)

ICLR Database (ICLR Database (with Textual Covariates))

#chinahate

BWB

KGRED (Knowledge-graph-enhanced relation extraction datasets--)

Ambiguous VQA