TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

3,148 dataset results

CIRCO (Composed Image Retrieval on Common Objects in context)

CIRCO (Composed Image Retrieval on Common Objects in context) is an open-domain benchmarking dataset for Composed Image Retrieval (CIR) based on real-world images from COCO 2017 unlabeled set. It is the first CIR dataset with multiple ground truths and aims to address the problem of false negatives in existing datasets. CIRCO comprises a total of 1020 queries, randomly divided into 220 and 800 for the validation and test set, respectively, with an average of 4.53 ground truths per query.

35 papers8 benchmarksImages, Texts

ARO

Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO-Order & Flickr30k-Order, to test for order sensitivity in VLMs. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases.

35 papers0 benchmarksImages, Texts

ProPara

The ProPara dataset is designed to train and test comprehension of simple paragraphs describing processes (e.g., photosynthesis), designed for the task of predicting, tracking, and answering questions about how entities change during the process.

34 papers0 benchmarksTexts

CoVoST

CoVoST is a large-scale multilingual speech-to-text translation corpus. Its latest 2nd version covers translations from 21 languages into English and from English into 15 languages. It has total 2880 hours of speech and is diversified with 78K speakers and 66 accents.

34 papers0 benchmarksAudio, Speech, Texts

GrailQA (Strongly Generalizable Question Answering)

GrailQA is a new large-scale, high-quality dataset for question answering on knowledge bases (KBQA) on Freebase with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

34 papers9 benchmarksTexts

xP3

xP3 is a multilingual dataset for multitask prompted finetuning. It is a composite of supervised datasets in 46 languages with English and machine-translated prompts.

34 papers0 benchmarksTexts

MTL-AQA

A new multitask action quality assessment (AQA) dataset, the largest to date, comprising of more than 1600 diving samples; contains detailed annotations for fine-grained action recognition, commentary generation, and estimating the AQA score. Videos from multiple angles provided wherever available.

33 papers12 benchmarksAudio, Texts, Videos

WMT 2015

WMT 2015 is a collection of datasets used in shared tasks of the Tenth Workshop on Statistical Machine Translation. The workshop featured five tasks:

33 papers0 benchmarksTexts

Dialogue State Tracking Challenge

The Dialog State Tracking Challenges 2 & 3 (DSTC2&3) were research challenge focused on improving the state of the art in tracking the state of spoken dialog systems. State tracking, sometimes called belief tracking, refers to accurately estimating the user's goal as a dialog progresses. Accurate state tracking is desirable because it provides robustness to errors in speech recognition, and helps reduce ambiguity inherent in language within a temporal process like dialog. In these challenges, participants were given labelled corpora of dialogs to develop state tracking algorithms. The trackers were then evaluated on a common set of held-out dialogs, which were released, un-labelled, during a one week period.

33 papers2 benchmarksDialog, Texts

SciREX

SCIREX is a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles. The dataset is annotated by integrating automatic and human annotations, leveraging existing scientific knowledge resources.

33 papers1 benchmarksTexts

CCAligned

CCAligned consists of parallel or comparable web-document pairs in 137 languages aligned with English. These web-document pairs were constructed by performing language identification on raw web-documents, and ensuring corresponding language codes were corresponding in the URLs of web documents. This pattern matching approach yielded more than 100 million aligned documents paired with English. Recognizing that each English document was often aligned to multiple documents in different target language, it is possible to join on English documents to obtain aligned documents that directly pair two non-English documents (e.g., Arabic-French).

33 papers0 benchmarksTexts

MeQSum

MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions.

33 papers1 benchmarksMedical, Texts

GLUCOSE

GLUCOSE is a large-scale dataset of implicit commonsense causal knowledge, encoded as causal mini-theories about the world, each grounded in a narrative context. To construct GLUCOSE, we drew on cognitive psychology to identify ten dimensions of causal explanation, focusing on events, states, motivations, and emotions. Each GLUCOSE entry includes a story-specific causal statement paired with an inference rule generalized from the statement.

33 papers0 benchmarksTexts

WikiEvents

WikiEvents is a document-level event extraction benchmark dataset which includes complete event and coreference annotation.

33 papers1 benchmarksTexts

WMT 2020

WMT 2020 is a collection of datasets used in shared tasks of the Fifth Conference on Machine Translation. The conference builds on a series of annual workshops and conferences on Statistical Machine Translation.

33 papers0 benchmarksTexts

SciBench

SciBench is a large-scale scientific problem-solving benchmark suite that aims to systematically examine the reasoning capabilities required for complex scientific problem solving. SciBench contains two carefully curated datasets: an open set featuring a range of collegiate-level scientific problems drawn from mathematics, chemistry, and physics textbooks, and a closed set comprising problems from undergraduate-level exams in computer science and mathematics.

33 papers0 benchmarksTexts

XSum

The Extreme Summarization (XSum) dataset is a dataset for evaluation of abstractive single-document summarization systems. The goal is to create a short, one-sentence new summary answering the question “What is the article about?”. The dataset consists of 226,711 news articles accompanied with a one-sentence summary. The articles are collected from BBC articles (2010 to 2017) and cover a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts). The official random split contains 204,045 (90%), 11,332 (5%) and 11,334 (5) documents in training, validation and test sets, respectively.

32 papers2 benchmarksTexts

ImageNet-P

ImageNet-P consists of noise, blur, weather, and digital distortions. The dataset has validation perturbations; has difficulty levels; has CIFAR-10, Tiny ImageNet, ImageNet 64 × 64, standard, and Inception-sized editions; and has been designed for benchmarking not training networks. ImageNet-P departs from ImageNet-C by having perturbation sequences generated from each ImageNet validation image. Each sequence contains more than 30 frames, so to counteract an increase in dataset size and evaluation time only 10 common perturbations are used.

32 papers1 benchmarksTexts

MedQuAD (Medical Question Answering Dataset)

MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.

32 papers0 benchmarksMedical, Texts

PUBHEALTH

PUBHEALTH is a comprehensive dataset for explainable automated fact-checking of public health claims. Each instance in the PUBHEALTH dataset has an associated veracity label (true, false, unproven, mixture). Furthermore each instance in the dataset has an explanation text field. The explanation is a justification for which the claim has been assigned a particular veracity label.

32 papers0 benchmarksTexts
PreviousPage 24 of 158Next