Datasets

3,148 machine learning datasets

3,148 dataset results

BoostCLIR

BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the MAREC patent data, and the data from the NTCIR PatentMT workshop collections, accompanied with relevance judgements for the task of patent prior-art search.

2 papers0 benchmarksTexts

DeCOCO

DeCOCO is a bilingual (English-German) corpus of image descriptions, where the English part is extracted from the COCO dataset, and the German part are translations by a native German speaker.

2 papers0 benchmarksTexts

Large-Scale CLIR Dataset

The Large-Scale CLIR Dataset is a retrieval dataset built for Cross-Language Information Retrieval (CLIR). The dataset is derived from Wikipedia and contains more 2.8 million English single-sentence queries with relevant documents from 25 other selected languages.

2 papers0 benchmarksTexts

SciGen

SciGen is a challenge dataset for the task of reasoning-aware data-to-text generation consisting of tables from scientific articles and their corresponding descriptions. The unique properties of SciGen are that (1) tables mostly contain numerical values, and (2) the corresponding descriptions require arithmetic reasoning. SciGen is therefore the first dataset that assesses the arithmetic reasoning capabilities of generation models on complex input structures, i.e., tables from scientific articles. SciGen opens new avenues for future research in reasoning-aware text generation and evaluation.

2 papers0 benchmarksImages, Texts

WikiCaps

WikiCaps is a large-scale multilingual but non-parallel data set for multimodal machine translation and retrieval. The image-caption data was extracted from Wikimedia Commons and is thus a representative of the collection of largely available non-descriptive image-caption pairs in the web. The current version of the dataset contains 3,816,940 images with 3,825,132 English captions and additional 1,000 image-caption pairs in German, French, and Russian together with their English counterparts.

2 papers0 benchmarksTexts

Hateful Users on Twitter

This is a Twitter dataset of 100,386 users along with up to 200 tweets from their timelines with a random-walk-based crawler on the retweet graph, with a subsample of 4,972 which is manually annotated as hateful or not through crowdsourcing. The dataset can be used to examine the difference between user activity patterns, the content disseminated between hateful and normal users, and network centrality measurements in the sampled graph.

2 papers0 benchmarksGraphs, Texts

robo-vln (Robotics Vision-and-Language Navigation)

The Robo-VLN dataset is a continuous control formulation of the VLN-CE dataset by Krantz et al ported over from Room-to-Room (R2R) dataset created by Anderson et al. The details regarding converting discrete VLN dataset into continuous control formulation can be found in our paper.

2 papers1 benchmarksImages, RGB-D, Texts, Time series

Signal-1M

The Signal Media One-Million News Articles Dataset dataset by Signal Media was released to facilitate researching news articles. It can be used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.

2 papers0 benchmarksTexts

Comparative Question Completion

Comparative Question Completion is a dataset to evaluate what do large Language Models learn.

2 papers0 benchmarksTexts

AM2iCo (Adversarial and Multilingual Meaning in Context)

AM2iCo is a wide-coverage and carefully designed cross-lingual and multilingual evaluation set. It aims to assess the ability of state-of-the-art representation models to reason over cross-lingual lexical-level concept alignment in context for 14 language pairs.

2 papers0 benchmarksTexts

GermanDPR

GermanDPR is a dataset for passage retrieval in German. GermanDPR comprises 8,245 question/answer pairs in the training set, 1,030 pairs in the development set, and 1,025 pairs in the test set. For each pair, there are one positive context and three hard negative contexts.

2 papers0 benchmarksTexts

Weibo-COV

Weibo-COV is a large-scale COVID-19 social media dataset from Weibo, covering more than 30 million posts from 1 November 2019 to 30 April 2020. Moreover, the field information of the dataset is very rich, including basic posts information, interactive information, location information and retweet network.

2 papers0 benchmarksTexts

EDNA-Covid

EDNA-Covid is a multilingual, large-scale dataset of coronavirus-related tweets collected since January 25, 2020. EDNA-Covid includes, at time of this publication, over 600M tweets from around the world in over 10 languages.

2 papers0 benchmarksTexts

UPFD-POL (User Preference-aware Fake News Detection)

The PolitiFact variant of the UPFD dataset for benchmarking.

2 papers2 benchmarksGraphs, Texts

ExpMRC

ExpMRC is a benchmark for the Explainability evaluation of Machine Reading Comprehension. ExpMRC contains four subsets of popular MRC datasets with additionally annotated evidences, including SQuAD, CMRC 2018, RACE+ (similar to RACE), and C3, covering span-extraction and multiple-choice questions MRC tasks in both English and Chinese.

2 papers0 benchmarksTexts

HLGD (Headline Grouping Dataset)

The Headline Grouping dataset is a binary classification dataset on pairs of news headline. For each pair of headline, the binary label indicates whether the two headlines are part of the same group (and describe the same underlying event), or whether they are in distinct groups. The dataset contains a total of 20k annotated headline pairs, further split in a train, validation and test portions.

2 papers0 benchmarksTexts

PreviousPage 89 of 158Next

Datasets

BoostCLIR

DeCOCO

Large-Scale CLIR Dataset

SciGen

WikiCaps

Hateful Users on Twitter

robo-vln (Robotics Vision-and-Language Navigation)

Signal-1M

Comparative Question Completion

AM2iCo (Adversarial and Multilingual Meaning in Context)

GermanDPR

Weibo-COV

EDNA-Covid

UPFD-POL (User Preference-aware Fake News Detection)

ExpMRC

HLGD (Headline Grouping Dataset)

AraCOVID19-MFH (AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech Detection Dataset)

R2VQ (Recipe-to-Video Questions)

Essay-BR

Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a sample of MIMIC-III clinical notes)

Datasets

BoostCLIR

DeCOCO

Large-Scale CLIR Dataset

SciGen

WikiCaps

Hateful Users on Twitter

robo-vln (Robotics Vision-and-Language Navigation)

Signal-1M

Comparative Question Completion

AM2iCo (Adversarial and Multilingual Meaning in Context)

GermanDPR

Weibo-COV

EDNA-Covid

UPFD-POL (User Preference-aware Fake News Detection)

ExpMRC

HLGD (Headline Grouping Dataset)

AraCOVID19-MFH (AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech Detection Dataset)

R2VQ (Recipe-to-Video Questions)

Essay-BR

Rare Diseases Mentions in MIMIC-III (Rare disease mention annotations from a sample of MIMIC-III clinical notes)