TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

3,148 dataset results

HLA-Chat

Models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs.

4 papers0 benchmarksTexts

irc-disentanglement

This is a dataset for disentangling conversations on IRC, which is the task of identifying separate conversations in a single stream of messages. It contains disentanglement information for 77,563 messages or IRC.

4 papers5 benchmarksTexts

IS-A

The IS-A dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are related by the “is a” relation. For example, ‘acute leukemia’ is a ‘leukemia’. The dataset has 294,693 nodes with 356,541 edges between them.

4 papers0 benchmarksGraphs, Medical, Texts

pn-summary

Pn-summary is a dataset for Persian abstractive text summarization.

4 papers0 benchmarksTexts

Quda

Aims to help V-NLIs recognize analytic tasks from free-form natural language by training and evaluating cutting-edge multi-label classification models. The dataset contains diverse user queries, and each is annotated with one or multiple analytic tasks.

4 papers0 benchmarksTexts

Query-Focused Video Summarization Dataset

Collects dense per-video-shot concept annotations.

4 papers2 benchmarksTexts, Videos

RONEC (Romanian Named Entity Corpus)

Romanian Named Entity Corpus is a named entity corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition.

4 papers0 benchmarksTexts

ViMMRC (Vietnamese Multiple-choice Machine Reading Comprehension Corpus)

A challenging machine comprehension corpus with multiple-choice questions, intended for research on the machine comprehension of Vietnamese text. This corpus includes 2,783 multiple-choice questions and answers based on a set of 417 Vietnamese texts used for teaching reading comprehension for 1st to 5th graders. Answers may be extracted from the contents of single or multiple sentences in the corresponding reading text.

4 papers0 benchmarksTexts

WikiSem500

The WikiSem500 dataset contains around 500 per-language cluster groups for English, Spanish, German, Chinese, and Japanese (a total of 13,314 test cases).

4 papers0 benchmarksTexts

XOR-TYDI QA

A large-scale dataset built on questions from TyDi QA lacking same-language answers.

4 papers0 benchmarksTexts

X-SRL

SRL is the task of extracting semantic predicate-argument structures from sentences. X-SRL is a multilingual parallel Semantic Role Labelling (SRL) corpus for English (EN), German (DE), French (FR) and Spanish (ES) that is based on English gold annotations and shares the same labelling scheme across languages.

4 papers0 benchmarksTexts

CC-DBP

CC-DBP is a dataset for knowledge base population research using Common Crawl and DBpedia.

4 papers0 benchmarksTexts

CCPE-M (Coached Conversational Preference Elicitation dataset for Movies)

A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.

4 papers0 benchmarksTexts

MuMu

MuMu is a new dataset of more than 31k albums classified into 250 genre classes.

4 papers0 benchmarksAudio, Images, Texts

ARC-DA (ARC Direct Answer Questions)

ARC Direct Answer Questions (ARC-DA) dataset consists of 2,985 grade-school level, direct-answer ("open response", "free form") science questions derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018.

4 papers0 benchmarksTexts

Advising Corpus

Advising Corpus is a dataset based on an entirely new collection of dialogues in which university students are being advised which classes to take. These were collected at the University of Michigan with IRB approval. They were released as part of DSTC 7 track 1 and used again in DSTC 8 track 2.

4 papers3 benchmarksTexts

Finnish Paraphrase Corpus

Finnish Paraphrase Corpus is a fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in the corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts.

4 papers0 benchmarksTexts

WEC-Eng

WEC-eng is a cross-document event coreference resolution dataset extracted from English Wikipedia. Coreference links are not restricted within predefined topics. The training set includes 40,529 mentions distributed into 7,042 coreference clusters.

4 papers0 benchmarksTexts

ATIS (vi) (Vietnamese Intent Detection and Slot Filling)

This is a dataset for intent detection and slot filling for the Vietnamese language. The dataset consists of 5,871 gold annotated utterances with 28 intent labels and 82 slot types.

4 papers2 benchmarksTexts

MS^2 (Multi-Document Summarization of Medical Studies)

MS^2 (Multi-Document Summarization of Medical Studies) is a dataset of over 470k documents and 20k summaries derived from the scientific literature. This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies, and is one of the first large-scale, publicly available multi-document summarization dataset in the biomedical domain.

4 papers2 benchmarksTexts
PreviousPage 67 of 158Next