3,148 machine learning datasets
3,148 dataset results
word2word contains easy-to-use word translations for 3,564 language pairs.
WikiHowQA is a Community-based Question Answering dataset, which can be used for both answer selection and abstractive summarization tasks. It contains 76,687 questions in the train set, 8,000 in the development set and 22,354 in the test set.
Coached Conversational Preference Elicitation is a dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'.
The Headlines dataset for sarcasm detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events. The dataset includes all the headlines from News in Brief and News in Photos categories (which are sarcastic) and real (and non-sarcastic) news headlines from HuffPost. This dataset has following advantages over the existing Twitter datasets:
HJDataset is a large dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts.
Dataset for lyrics alignment and transcription evaluation. It contains 20 music pieces under CC license from the Jamendo website along with their lyrics, with:
The MLQE dataset is a dataset for sentence-level Machine Translation Quality Estimation. It consists of 6 language pairs representing NMT training in high, medium, and low-resource scenarios. The corpus is extracted from Wikipedia, and 10K segments per language pair are annotated.
PerSenT is a dataset of crowd-sourced annotations of the sentiment expressed by the authors towards the main entities in news articles. The dataset also includes paragraph-level sentiment annotations to provide more fine-grained supervision for the task.
RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains, which is larger than all previous annotated H2H conversational datasets. Both single- and multi-domain dialogues are constructed, accounting for 65% and 35%, respectively. Each dialogue is labelled with comprehensive dialogue annotations, including dialogue goal in the form of natural language description, domain, dialogue states and acts at both the user and system side. In addition to traditional dialogue annotations, it also includes linguistic annotations on discourse phenomena, e.g., ellipsis and coreference, in dialogues, which are useful for dialogue coreference and ellipsis resolution tasks.
A large-scale evaluation set that provides human ratings for the plausibility of 10,000 SP pairs over five SP relations, covering 2,500 most frequent verbs, nouns, and adjectives in American English.
The Taskmaster-2 dataset consists of 17,289 dialogs in seven domains: restaurants (3276), food ordering (1050), movies (3047), hotels (2355), flights (2481), music (1602), and sports (3478).
A collection of 2511 recipes for zero-shot learning, recognition and anticipation.
Twitter100k is a large-scale dataset for weakly supervised cross-media retrieval.
The Sarcasm Corpus contains sarcastic and non-sarcastic utterances of three different types, which are balanced with half of the samples being sarcastic and half non-sarcastic. The three types are:
IG-1B-Targeted is an internal Facebook AI Research dataset that consists of 940 million public images with 1.5K hashtags matching with 1000 ImageNet1K synsets.
This dataset consists of images and annotations in Bengali. The images are human annotated in Bengali by two adult native Bengali speakers. All popular image captioning datasets have a predominant western cultural bias with the annotations done in English. Using such datasets to train an image captioning system assumes that a good English to target language translation system exists and that the original dataset had elements of the target culture. Both these assumptions are false, leading to the need of a culturally relevant dataset in Bengali, to generate appropriate image captions of images relevant to the Bangladeshi and wider subcontinental context. The dataset presented consists of 9,154 images.
GAD, or Gene Associations Database, is a corpus of gene-disease associations curated from genetic association studies.
CSFCube is an expert annotated test collection to evaluate models trained to perform faceted Query by Example. This test collection consists of a diverse set of 50 query documents, drawn from computational linguistics and machine learning venues
This dataset arises from the READ project (Horizon 2020).
Twitter-MEL is a multimodal entity linking (MEL) dataset built from Twitter. The dataset consists of tweets that had both text and images, with a total of 2.6M timeline tweets and 20k entities.