TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

3,148 dataset results

Lazaro Corpus

A corpus of 21,570 newspaper headlines written in European Spanish annotated with emergent anglicisms.

1 papers0 benchmarksTexts

Lenta Short Sentences

The Lenta Short Sentences dataset is a text dataset for language modelling for the Russian language. It consists of 236K sentences sampled from the Lenta News dataset.

1 papers0 benchmarksTexts

LSICC (Large Scale Informal Chinese Corpus)

Large Scale Informal Chinese Corpus (LSICC) is a large-scale corpus of informal Chinese. This corpus contains around 37 million book reviews and 50 thousand netizen's comments to the news.

1 papers0 benchmarksTexts

Mafiascum

A collection of over 700 games of Mafia, in which players are randomly assigned either deceptive or non-deceptive roles and then interact via forum postings. Over 9000 documents were compiled from the dataset, which each contained all messages written by a single player in a single game. This corpus was used to construct a set of hand-picked linguistic features based on prior deception research, as well as a set of average word vectors enriched with subword information.

1 papers0 benchmarksTexts

MalayalamMixSentiment

MalayalamMixSentiment is a Sentiment Analysis Dataset for Code-Mixed Malayalam-English.

1 papers0 benchmarksTexts

Marmara Turkish Coreference Resolution Corpus

Describe the Marmara Turkish Coreference Corpus, which is an annotation of the whole METU-Sabanci Turkish Treebank with mentions and coreference chains.

1 papers0 benchmarksTexts

Medical Case Report Corpus

Medical Case Report Corpus is a new corpus comprising annotations of medical entities in case reports, originating from PubMed Central's open access library.

1 papers0 benchmarksMedical, Texts

medisim

medisim is a collection of new large-scale medical term similarity datasets based on SNOMED-CT.

1 papers0 benchmarksTexts

Mega-COV

Mega-COV is a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 234 countries), longitudinal (goes as back as 2007), multilingual (comes in 65 languages), and has a significant number of location-tagged tweets (~32M tweets).

1 papers0 benchmarksTexts

MK-SQuIT

An example dataset of 110,000 question/query pairs across four WikiData domains.

1 papers0 benchmarksTexts

Modern Hebrew Sentiment Dataset

Modern Hebrew Sentiment Dataset is a sentiment analysis benchmark for Hebrew, based on 12K social media comments, and provide two instances of these data: in token-based and morpheme-based settings.

1 papers0 benchmarksTexts

MultiWOZ-coref

MultiWOZ-coref, (or MultiWOZ 2.3) is an extension of the MultiWOZ dataset that adds co-reference annotations in addition to corrections of dialogue acts and dialogue states.

1 papers0 benchmarksTexts

MWE-CWI

Multiword expressions (MWEs) represent lexemes that should be treated as single lexical units due to their idiosyncratic nature. MWE-CWI is a dataset for MWE detection based on the Complex Word Identification Shared Task 2018 dataset.

1 papers0 benchmarksTexts

NLI-PT

The first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP.

1 papers0 benchmarksTexts

Permuted bAbI dialog task

The Permuted bAbi dialog task is an adaptation of the "Dialog bAbI tasks data" dataset released by Facebook. It is used for evaluating end-to-end dialog systems in the restaurant domain. This dataset introduces multiple valid next utterances to the original-bAbI dialog tasks, which allows evaluation of end-to-end goal-oriented dialog systems in a more realistic setting.

1 papers0 benchmarksTexts

PheMT

PheMT is a phenomenon-wise dataset designed for evaluating the robustness of Japanese-English machine translation systems. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena common in UGC; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant

1 papers0 benchmarksTexts

PMC-SA (PMC Structured Abstracts)

PMC-SA (PMC Structured Abstracts) is a dataset of academic publications, used for the task of structured summarization.

1 papers0 benchmarksTexts

Pow-Wow

A dataset for studying situated goal-directed human communication.

1 papers0 benchmarksTexts

Proto Summ

This is a large-scale court judgment dataset, where each judgment is a summary of the case description with a patternized style. It contains 2,003,390 court judgment documents. The case description is used as the input, and the court judgment as the summary. The average lengths of the input documents and summaries are 595.15 words and 273.57 words respectively.

1 papers0 benchmarksTexts

public_meetings

The public_meetings corpus contains meetings, made of pairs of automatic transcriptions from audio recordings and meeting reports written by a professional. 22 aligned meetings are provided in total.

1 papers0 benchmarksTexts
PreviousPage 105 of 158Next