3,148 machine learning datasets
3,148 dataset results
A corpus of 21,570 newspaper headlines written in European Spanish annotated with emergent anglicisms.
The Lenta Short Sentences dataset is a text dataset for language modelling for the Russian language. It consists of 236K sentences sampled from the Lenta News dataset.
Large Scale Informal Chinese Corpus (LSICC) is a large-scale corpus of informal Chinese. This corpus contains around 37 million book reviews and 50 thousand netizen's comments to the news.
A collection of over 700 games of Mafia, in which players are randomly assigned either deceptive or non-deceptive roles and then interact via forum postings. Over 9000 documents were compiled from the dataset, which each contained all messages written by a single player in a single game. This corpus was used to construct a set of hand-picked linguistic features based on prior deception research, as well as a set of average word vectors enriched with subword information.
MalayalamMixSentiment is a Sentiment Analysis Dataset for Code-Mixed Malayalam-English.
Describe the Marmara Turkish Coreference Corpus, which is an annotation of the whole METU-Sabanci Turkish Treebank with mentions and coreference chains.
Medical Case Report Corpus is a new corpus comprising annotations of medical entities in case reports, originating from PubMed Central's open access library.
medisim is a collection of new large-scale medical term similarity datasets based on SNOMED-CT.
Mega-COV is a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 234 countries), longitudinal (goes as back as 2007), multilingual (comes in 65 languages), and has a significant number of location-tagged tweets (~32M tweets).
An example dataset of 110,000 question/query pairs across four WikiData domains.
Modern Hebrew Sentiment Dataset is a sentiment analysis benchmark for Hebrew, based on 12K social media comments, and provide two instances of these data: in token-based and morpheme-based settings.
MultiWOZ-coref, (or MultiWOZ 2.3) is an extension of the MultiWOZ dataset that adds co-reference annotations in addition to corrections of dialogue acts and dialogue states.
Multiword expressions (MWEs) represent lexemes that should be treated as single lexical units due to their idiosyncratic nature. MWE-CWI is a dataset for MWE detection based on the Complex Word Identification Shared Task 2018 dataset.
The first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP.
The Permuted bAbi dialog task is an adaptation of the "Dialog bAbI tasks data" dataset released by Facebook. It is used for evaluating end-to-end dialog systems in the restaurant domain. This dataset introduces multiple valid next utterances to the original-bAbI dialog tasks, which allows evaluation of end-to-end goal-oriented dialog systems in a more realistic setting.
PheMT is a phenomenon-wise dataset designed for evaluating the robustness of Japanese-English machine translation systems. The dataset is based on the MTNT dataset, with additional annotations of four linguistic phenomena common in UGC; Proper Noun, Abbreviated Noun, Colloquial Expression, and Variant
PMC-SA (PMC Structured Abstracts) is a dataset of academic publications, used for the task of structured summarization.
A dataset for studying situated goal-directed human communication.
This is a large-scale court judgment dataset, where each judgment is a summary of the case description with a patternized style. It contains 2,003,390 court judgment documents. The case description is used as the input, and the court judgment as the summary. The average lengths of the input documents and summaries are 595.15 words and 273.57 words respectively.
The public_meetings corpus contains meetings, made of pairs of automatic transcriptions from audio recordings and meeting reports written by a professional. 22 aligned meetings are provided in total.