3,148 machine learning datasets
3,148 dataset results
Mint is a new Multilingual intimacy analysis dataset covering 13,384 tweets in 10 languages including English, French, Spanish, Italian, Portuguese, Korean, Dutch, Chinese, Hindi, and Arabic. The dataset is released along with the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis.
This is a dataset of 3 English books which do not contain the letter "e" in them. This dataset includes all of "Gadsby" by Ernest Vincent Wright, all of "A Void" by Georges Perec, and almost all of "Eunoia" by Christian Bok (except for the single chapter that uses the letter "e" in it)
Corpus for argument mining in legal documents, composed of 40 decisions of the Court of Justice of the European Union on matters of fiscal state aid
XiaChuFang Recipe Corpus contains recipes are from 下厨房 (XiaChuFang), a popular Chinese recipe sharing website. The full recipe corpus contains 1,520,327 Chinese recipes. Among them, 1,242,206 recipes belong to 30,060 dishes. A dish has 41.3 recipes on average.
The Reddit Climate Change Dataset is a dataset of 620K Reddit posts and 4.6M comments - all mentions of the terms "climate" and "change" until 2022-09-01 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.
The data used in - "Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy" (Bowles et al. submitted) - "A New Task: Deriving Semantic Class Targets for the Physical Sciences" (Bowles et al. 2022: https://arxiv.org/abs/2210.14760) accepted at the Fifth Workshop on Machine Learning and the Physical Sciences, Neural Information Processing Systems 2022.
Are Large Pre-Trained Language Models Leaking Your Personal Information? We analyze whether Pre-Trained Language Models (PLMs) are prone to leaking personal information. Specifically, we query PLMs for email addresses with contexts of the email address or prompts containing the owner's name.
Financial Language Understanding Evaluation is an open-source comprehensive suite of benchmarks for the financial domain. It contains benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. The tasks are financial sentiment analysis, news headline classification, named entity recognition, structure boundary detection and question answering.
E2E Refined is a dataset for sentence classification. It consists of 40,560 examples for training, 4,489 for validation, and 4,555 for test. It is a refined version of the well-known MR-to-text E2E dataset where many deletion/insertion/substitution errors has been fixed.
Jericho Environment Commonsense Comprehension (JECC) is a dataset for commonsense reasoning. It consists of 29 games in multiple domains from the Jericho Environment hausknecht2019interactive.
NLI4Wills Corpus can be used to train transformers and sentence-transformer models for the validity evaluation of the legal will statements. Our dataset consists of ID numbers, three types of inputs (legal will statements, laws, and conditions) and classifications (support, refute, or unrelated).
TempWikiBio is a new data-to-text generation dataset containing more than 4 millions of chronologically ordered revisions of biographical articles from English Wikipedia, each paired with structured personal profiles.
EventEA is an event-centric entity alignment dataset, harvested from EventKG, DBpedia and Wikidata.
NJH is a dataset of over 40,000 tweets about immigration from the US and UK, annotated with six labels for different aspects of incivility and intolerance. It is a more fine-grained multi-label approach to predicting incivility and hateful or intolerant content.
CoRAL is a language and culturally aware Croatian Abusive dataset covering phenomena of implicitness and reliance on local and global context.
A maintained database tracks ICLR submissions and reviews, augmented with author profiles and higher-level textual features.
#chinahate dataset contains a total of 2,172,333 tweets hashtagged #china posted during the time it was collected. It is designed for the task of hate speech detection.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The Ambiguous VQA dataset is a dataset of ambiguous questions about images. It consists of a set of ambiguous images and their answers. It is used to train and evaluate question generation models in English.