Datasets

3,148 machine learning datasets

3,148 dataset results

word2word

word2word contains easy-to-use word translations for 3,564 language pairs.

WikiHowQA is a Community-based Question Answering dataset, which can be used for both answer selection and abstractive summarization tasks. It contains 76,687 questions in the train set, 8,000 in the development set and 22,354 in the test set.

5 papers0 benchmarksTexts

Coached Conversational Preference Elicitation

Coached Conversational Preference Elicitation is a dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'.

5 papers0 benchmarksDialog, Texts

Headlines dataset

The Headlines dataset for sarcasm detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events. The dataset includes all the headlines from News in Brief and News in Photos categories (which are sarcastic) and real (and non-sarcastic) news headlines from HuffPost. This dataset has following advantages over the existing Twitter datasets:

5 papers0 benchmarksTexts

HJDataset

HJDataset is a large dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts.

5 papers0 benchmarksImages, Texts

Jamendo Lyrics

Dataset for lyrics alignment and transcription evaluation. It contains 20 music pieces under CC license from the Jamendo website along with their lyrics, with:

5 papers0 benchmarksAudio, Texts

MLQE (MultiLingual Quality Estimation)

The MLQE dataset is a dataset for sentence-level Machine Translation Quality Estimation. It consists of 6 language pairs representing NMT training in high, medium, and low-resource scenarios. The corpus is extracted from Wikipedia, and 10K segments per language pair are annotated.

5 papers0 benchmarksTexts

PerSenT

PerSenT is a dataset of crowd-sourced annotations of the sentiment expressed by the authors towards the main entities in news articles. The dataset also includes paragraph-level sentiment annotations to provide more fine-grained supervision for the task.

5 papers0 benchmarksTexts

RiSAWOZ

RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains, which is larger than all previous annotated H2H conversational datasets. Both single- and multi-domain dialogues are constructed, accounting for 65% and 35%, respectively. Each dialogue is labelled with comprehensive dialogue annotations, including dialogue goal in the form of natural language description, domain, dialogue states and acts at both the user and system side. In addition to traditional dialogue annotations, it also includes linguistic annotations on discourse phenomena, e.g., ellipsis and coreference, in dialogues, which are useful for dialogue coreference and ellipsis resolution tasks.

5 papers0 benchmarksTexts

SP-10K

A large-scale evaluation set that provides human ratings for the plausibility of 10,000 SP pairs over five SP relations, covering 2,500 most frequent verbs, nouns, and adjectives in American English.

5 papers0 benchmarksTexts

Taskmaster-2

The Taskmaster-2 dataset consists of 17,289 dialogs in seven domains: restaurants (3276), food ordering (1050), movies (3047), hotels (2355), flights (2481), music (1602), and sports (3478).

5 papers0 benchmarksDialog, Texts

Tasty Videos

A collection of 2511 recipes for zero-shot learning, recognition and anticipation.

5 papers0 benchmarksTexts, Videos

Twitter100k

Twitter100k is a large-scale dataset for weakly supervised cross-media retrieval.

5 papers0 benchmarksImages, Texts

Sarcasm Corpus V2

The Sarcasm Corpus contains sarcastic and non-sarcastic utterances of three different types, which are balanced with half of the samples being sarcastic and half non-sarcastic. The three types are:

5 papers0 benchmarksTexts

IG-1B-Targeted

IG-1B-Targeted is an internal Facebook AI Research dataset that consists of 940 million public images with 1.5K hashtags matching with 1000 ImageNet1K synsets.

5 papers0 benchmarksImages, Texts

BanglaLekhaImageCaptions

This dataset consists of images and annotations in Bengali. The images are human annotated in Bengali by two adult native Bengali speakers. All popular image captioning datasets have a predominant western cultural bias with the annotations done in English. Using such datasets to train an image captioning system assumes that a good English to target language translation system exists and that the original dataset had elements of the target culture. Both these assumptions are false, leading to the need of a culturally relevant dataset in Bengali, to generate appropriate image captions of images relevant to the Bangladeshi and wider subcontinental context. The dataset presented consists of 9,154 images.

5 papers8 benchmarksImages, Texts

PreviousPage 61 of 158Next

Datasets

word2word

WikiHowQA

Coached Conversational Preference Elicitation

Headlines dataset

HJDataset

Jamendo Lyrics

MLQE (MultiLingual Quality Estimation)

PerSenT

RiSAWOZ

SP-10K

Taskmaster-2

Tasty Videos

Twitter100k

Sarcasm Corpus V2

IG-1B-Targeted

BanglaLekhaImageCaptions

GAD (Gene Associations Database)

CSFCube

READ 2016 (HTR Dataset ICFHR 2016)

Twitter-MEL

Datasets

word2word

WikiHowQA

Coached Conversational Preference Elicitation

Headlines dataset

HJDataset

Jamendo Lyrics

MLQE (MultiLingual Quality Estimation)

PerSenT

RiSAWOZ

SP-10K

Taskmaster-2

Tasty Videos

Twitter100k

Sarcasm Corpus V2

IG-1B-Targeted

BanglaLekhaImageCaptions

GAD (Gene Associations Database)

CSFCube

READ 2016 (HTR Dataset ICFHR 2016)

Twitter-MEL