3,148 machine learning datasets
3,148 dataset results
When arriving at each state, each observation token gets a coin toss to see whether it will appear in the output observation string. Numbers on the left are indices of observations, numbers on the right are indices of states.
The data is about 1.5 million English tweets annotated for part-of-speech using Ritter's extension of the PTB tagset. The tweets are from 2012 and 2013, tokenized using the GATE tokenizer and tagged jointly using the CMU ARK tagger and Ritter's T-POS tagger. Only when both these taggers' outputs are completely compatible over a whole tweet, is that tweet added to the dataset.
PTB-tagged English Tweets
This is a stance detection dataset in the Zulu language. The data is translated to Zulu by Zulu native speakers, from English source texts.
Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine-learning approach for automatic language identification for the Nordic languages, which often suffer miscategorization by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmål), Faroese, and Icelandic. This is the data for the tasks. Two variants are provided: 10K and 50K, withholding 10,000 and 50,000 examples for each language respectively.
This dataset is parallel text for Bornholmsk and Danish.
This is a high-quality dataset of annotated posts sampled from social media posts and annotated for misogyny. Danish language.
This is an abusive/offensive language detection dataset for Albanian. The data is formatted following the OffensEval convention. Data is from Instagram and YouTube comments.
Political stance in Danish. Examples represent statements by politicians and are annotated for, against, or neutral to a given topic/article.
A collection of diacritized Hebrew text in a variety of registers and from different sources.
We scraped the Gutenberg Project and a subset of English Wikipedia to obtain the list of sentences that contain any. Next, using a combination of heuristics, we filtered the result with regular expressions to produce two sets of sentences (the second set underwent additional manual filtration): * 3844 sentences with sentential negation and a plural object with any to the right to the verb; * 330 sentences with nobody / no one as subject and a plural object with any to the right.
We used the following procedure. First, we automatically identified the set of verbs and nouns to build our items from. To do so, we started with bert-base-uncased vocabulary. We ran all non-subword lexical tokens through a SpaCy POS. Further, we lemmatized the result using https://pypi.org/project/Pattern/ and dropped duplicates. Then, we filtered out modal verbs, singularia tantum nouns and some visible lemmatization mistakes. Finally, we filtered out non-transitive verbs to give the dataset a bit of a higher baseline of grammaticality.
Minecraft Corpus dataset with builder utterance annotations
The task addresses the problem of the appearance and propagation of posts that share misleading multimedia content (images or video). In the context of the task, different types of misleading use are considered:
A dataset comprising 8,551 ban evasion pairs on Wikipedia, where each pair comprises a parent account and the child account. We adopt a strategy to ensure that there is a 1:1 mapping between parent and child accounts. For each of the accounts in these ban evasion pairs, we provide the following data: - Wikipedia usernames, creation date, ban date, and other account-level meta-data - Corresponding edit information in form of revision IDs, pages edited, added text, deleted text, edit comment, and timestamp
It contains data from two different realities: Food.com, a well-known American recipe site, and Planeat, an Italian site that allows you to plan recipes to save food waste. The dataset is divided into two parts: embeddings, which can be used directly to execute the work and receive suggestions, and raw data, which must first be processed into embeddings.
OAGL is a paper topic dataset consisting of 6942930 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last two fields of each record are the topic id from a taxonomy of 27 topics created from the entire collection and the 20 most significant topic words. Each dataset record (sample) is stored as a JSON line in the text file.
French sentences are sourced from Tatoeba repository and then translated into Congolese Swahili.
This is the Big-Bench version of our language-based movie recommendation dataset
It includes 10 data sets that consists of both raw data set and encoded data set where it is encoded through BERT-Sort Encoder with MLM initialization of .