TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

3,148 dataset results

BoostCLIR

BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the MAREC patent data, and the data from the NTCIR PatentMT workshop collections, accompanied with relevance judgements for the task of patent prior-art search.

2 papers0 benchmarksTexts

DeCOCO

DeCOCO is a bilingual (English-German) corpus of image descriptions, where the English part is extracted from the COCO dataset, and the German part are translations by a native German speaker.

2 papers0 benchmarksTexts

Large-Scale CLIR Dataset

The Large-Scale CLIR Dataset is a retrieval dataset built for Cross-Language Information Retrieval (CLIR). The dataset is derived from Wikipedia and contains more 2.8 million English single-sentence queries with relevant documents from 25 other selected languages.

2 papers0 benchmarksTexts

SciGen

SciGen is a challenge dataset for the task of reasoning-aware data-to-text generation consisting of tables from scientific articles and their corresponding descriptions. The unique properties of SciGen are that (1) tables mostly contain numerical values, and (2) the corresponding descriptions require arithmetic reasoning. SciGen is therefore the first dataset that assesses the arithmetic reasoning capabilities of generation models on complex input structures, i.e., tables from scientific articles. SciGen opens new avenues for future research in reasoning-aware text generation and evaluation.

2 papers0 benchmarksImages, Texts

WikiCaps

WikiCaps is a large-scale multilingual but non-parallel data set for multimodal machine translation and retrieval. The image-caption data was extracted from Wikimedia Commons and is thus a representative of the collection of largely available non-descriptive image-caption pairs in the web. The current version of the dataset contains 3,816,940 images with 3,825,132 English captions and additional 1,000 image-caption pairs in German, French, and Russian together with their English counterparts.

2 papers0 benchmarksTexts

Hateful Users on Twitter

This is a Twitter dataset of 100,386 users along with up to 200 tweets from their timelines with a random-walk-based crawler on the retweet graph, with a subsample of 4,972 which is manually annotated as hateful or not through crowdsourcing. The dataset can be used to examine the difference between user activity patterns, the content disseminated between hateful and normal users, and network centrality measurements in the sampled graph.

2 papers0 benchmarksGraphs, Texts

robo-vln (Robotics Vision-and-Language Navigation)

The Robo-VLN dataset is a continuous control formulation of the VLN-CE dataset by Krantz et al ported over from Room-to-Room (R2R) dataset created by Anderson et al. The details regarding converting discrete VLN dataset into continuous control formulation can be found in our paper.

2 papers1 benchmarksImages, RGB-D, Texts, Time series

Signal-1M

The Signal Media One-Million News Articles Dataset dataset by Signal Media was released to facilitate researching news articles. It can be used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.

2 papers0 benchmarksTexts

Comparative Question Completion

Comparative Question Completion is a dataset to evaluate what do large Language Models learn.

2 papers0 benchmarksTexts

AM2iCo (Adversarial and Multilingual Meaning in Context)

AM2iCo is a wide-coverage and carefully designed cross-lingual and multilingual evaluation set. It aims to assess the ability of state-of-the-art representation models to reason over cross-lingual lexical-level concept alignment in context for 14 language pairs.

2 papers0 benchmarksTexts

GermanDPR

GermanDPR is a dataset for passage retrieval in German. GermanDPR comprises 8,245 question/answer pairs in the training set, 1,030 pairs in the development set, and 1,025 pairs in the test set. For each pair, there are one positive context and three hard negative contexts.

2 papers0 benchmarksTexts

Weibo-COV

Weibo-COV is a large-scale COVID-19 social media dataset from Weibo, covering more than 30 million posts from 1 November 2019 to 30 April 2020. Moreover, the field information of the dataset is very rich, including basic posts information, interactive information, location information and retweet network.

2 papers0 benchmarksTexts

EDNA-Covid

EDNA-Covid is a multilingual, large-scale dataset of coronavirus-related tweets collected since January 25, 2020. EDNA-Covid includes, at time of this publication, over 600M tweets from around the world in over 10 languages.

2 papers0 benchmarksTexts

UPFD-POL (User Preference-aware Fake News Detection)

The PolitiFact variant of the UPFD dataset for benchmarking.

2 papers2 benchmarksGraphs, Texts

ExpMRC

ExpMRC is a benchmark for the Explainability evaluation of Machine Reading Comprehension. ExpMRC contains four subsets of popular MRC datasets with additionally annotated evidences, including SQuAD, CMRC 2018, RACE+ (similar to RACE), and C3, covering span-extraction and multiple-choice questions MRC tasks in both English and Chinese.

2 papers0 benchmarksTexts

HLGD (Headline Grouping Dataset)

The Headline Grouping dataset is a binary classification dataset on pairs of news headline. For each pair of headline, the binary label indicates whether the two headlines are part of the same group (and describe the same underlying event), or whether they are in distinct groups. The dataset contains a total of 20k annotated headline pairs, further split in a train, validation and test portions.

2 papers0 benchmarksTexts

AraCOVID19-MFH (AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech Detection Dataset)

AraCOVID19-MFH is a manually annotated multi-label Arabic COVID-19 fake news and hate speech detection dataset. The dataset contains 10,828 Arabic tweets annotated with 10 different labels.

2 papers0 benchmarksTexts

R2VQ (Recipe-to-Video Questions)

R2VQ is a dataset designed for testing competence-based comprehension of machines over a multimodal recipe collection, which contains text-video aligned recipes.

2 papers0 benchmarksTexts, Videos

Essay-BR

This repository contains essays written by high school Brazilian students. These essays were graded by humans professionals following the criteria of the ENEM exam.

2 papers0 benchmarksTexts

Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a sample of MIMIC-III clinical notes)

Data annotation The 1,073 full rare disease mention annotations (from 312 MIMIC-III discharge summaries) are in full_set_RD_ann_MIMIC_III_disch.csv.

2 papers1 benchmarksTexts
PreviousPage 89 of 158Next