TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

3,148 dataset results

Mint (Multilingual Intimacy analysis)

Mint is a new Multilingual intimacy analysis dataset covering 13,384 tweets in 10 languages including English, French, Spanish, Italian, Portuguese, Korean, Dutch, Chinese, Hindi, and Arabic. The dataset is released along with the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis.

1 papers0 benchmarksTexts

Lipogram-e

This is a dataset of 3 English books which do not contain the letter "e" in them. This dataset includes all of "Gadsby" by Ernest Vincent Wright, all of "A Void" by Georges Perec, and almost all of "Eunoia" by Christian Bok (except for the single chapter that uses the letter "e" in it)

1 papers2 benchmarksTexts

Demosthenes

Corpus for argument mining in legal documents, composed of 40 decisions of the Court of Justice of the European Union on matters of fiscal state aid

1 papers0 benchmarksTexts

XiaChuFang Recipe Corpus

XiaChuFang Recipe Corpus contains recipes are from 下厨房 (XiaChuFang), a popular Chinese recipe sharing website. The full recipe corpus contains 1,520,327 Chinese recipes. Among them, 1,242,206 recipes belong to 30,060 dishes. A dish has 41.3 recipes on average.

1 papers0 benchmarksTexts

The Reddit Climate Change Dataset

The Reddit Climate Change Dataset is a dataset of 620K Reddit posts and 4.6M comments - all mentions of the terms "climate" and "change" until 2022-09-01 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.

1 papers0 benchmarksTabular, Texts

RGZ EMU: Semantic Taxonomy (Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy)

The data used in - "Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy" (Bowles et al. submitted) - "A New Task: Deriving Semantic Class Targets for the Physical Sciences" (Bowles et al. 2022: https://arxiv.org/abs/2210.14760) accepted at the Fifth Workshop on Machine Learning and the Physical Sciences, Neural Information Processing Systems 2022.

1 papers0 benchmarksImages, Tabular, Texts

LM Email Address Leakage

Are Large Pre-Trained Language Models Leaking Your Personal Information? We analyze whether Pre-Trained Language Models (PLMs) are prone to leaking personal information. Specifically, we query PLMs for email addresses with contexts of the email address or prompts containing the owner's name.

1 papers0 benchmarksTexts

Financial Language Understanding Evaluation

Financial Language Understanding Evaluation is an open-source comprehensive suite of benchmarks for the financial domain. It contains benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. The tasks are financial sentiment analysis, news headline classification, named entity recognition, structure boundary detection and question answering.

1 papers0 benchmarksTexts

E2E Refined

E2E Refined is a dataset for sentence classification. It consists of 40,560 examples for training, 4,489 for validation, and 4,555 for test. It is a refined version of the well-known MR-to-text E2E dataset where many deletion/insertion/substitution errors has been fixed.

1 papers0 benchmarksTexts

JECC (Jericho Environment Commonsense Comprehension)

Jericho Environment Commonsense Comprehension (JECC) is a dataset for commonsense reasoning. It consists of 29 games in multiple domains from the Jericho Environment hausknecht2019interactive.

1 papers0 benchmarksTexts

NLI4Wills Corpus

NLI4Wills Corpus can be used to train transformers and sentence-transformer models for the validity evaluation of the legal will statements. Our dataset consists of ID numbers, three types of inputs (legal will statements, laws, and conditions) and classifications (support, refute, or unrelated).

1 papers0 benchmarksTexts

TempWikiBio

TempWikiBio is a new data-to-text generation dataset containing more than 4 millions of chronologically ordered revisions of biographical articles from English Wikipedia, each paired with structured personal profiles.

1 papers0 benchmarksTexts

EventEA

EventEA is an event-centric entity alignment dataset, harvested from EventKG, DBpedia and Wikidata.

1 papers0 benchmarksTexts

NJH (Not Just Hate)

NJH is a dataset of over 40,000 tweets about immigration from the US and UK, annotated with six labels for different aspects of incivility and intolerance. It is a more fine-grained multi-label approach to predicting incivility and hateful or intolerant content.

1 papers0 benchmarksTexts

CoRAL dataset (CoRAL: a Context-aware Croatian Abusive Language Dataset)

CoRAL is a language and culturally aware Croatian Abusive dataset covering phenomena of implicitness and reliance on local and global context.

1 papers0 benchmarksTexts

ICLR Database (ICLR Database (with Textual Covariates))

A maintained database tracks ICLR submissions and reviews, augmented with author profiles and higher-level textual features.

1 papers0 benchmarksTabular, Texts

#chinahate

#chinahate dataset contains a total of 2,172,333 tweets hashtagged #china posted during the time it was collected. It is designed for the task of hate speech detection.

1 papers0 benchmarksTexts

BWB

The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.

1 papers0 benchmarksTexts

KGRED (Knowledge-graph-enhanced relation extraction datasets--)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

Ambiguous VQA

The Ambiguous VQA dataset is a dataset of ambiguous questions about images. It consists of a set of ambiguous images and their answers. It is used to train and evaluate question generation models in English.

1 papers0 benchmarksImages, Texts
PreviousPage 121 of 158Next