Datasets

3,148 machine learning datasets

3,148 dataset results

State Traversal Observation Tokens

When arriving at each state, each observation token gets a coin toss to see whether it will appear in the output observation string. Numbers on the left are indices of observations, numbers on the right are indices of states.

1 papers0 benchmarksTexts

Twitter PoS VCB (Twitter part-of-speech vote-constrained-bootstrapping)

The data is about 1.5 million English tweets annotated for part-of-speech using Ritter's extension of the PTB tagset. The tweets are from 2012 and 2013, tokenized using the GATE tokenizer and tagged jointly using the CMU ARK tagger and Ritter's T-POS tagger. Only when both these taggers' outputs are completely compatible over a whole tweet, is that tweet added to the dataset.

1 papers0 benchmarksTexts

Ritter PoS (Ritter Twitter part-of-speech tagging)

PTB-tagged English Tweets

1 papers0 benchmarksTexts

zulu-stance (Zulu Stance)

This is a stance detection dataset in the Zulu language. The data is translated to Zulu by Zulu native speakers, from English source texts.

1 papers0 benchmarksTexts

Nordic Language Identification

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine-learning approach for automatic language identification for the Nordic languages, which often suffer miscategorization by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokmål), Faroese, and Icelandic. This is the data for the tasks. Two variants are provided: 10K and 50K, withholding 10,000 and 50,000 examples for each language respectively.

1 papers1 benchmarksTexts

Bornholmsk

This dataset is parallel text for Bornholmsk and Danish.

1 papers0 benchmarksTexts

bajer_danish_misogyny (Bajer Online Misogyny)

This is a high-quality dataset of annotated posts sampled from social media posts and annotated for misogyny. Danish language.

1 papers2 benchmarksTexts

SHAJ (Spoken Hate in the Albanian Jargon)

This is an abusive/offensive language detection dataset for Albanian. The data is formatted following the OffensEval convention. Data is from Instagram and YouTube comments.

1 papers2 benchmarksTexts

polstance (Political Stance in Danish)

Political stance in Danish. Examples represent statements by politicians and are annotated for, against, or neutral to a given topic/article.

1 papers0 benchmarksTexts

Nakdimon-train

A collection of diacritized Hebrew text in a variety of registers and from different sources.

1 papers0 benchmarksTexts

Natural sentences that contain any

We scraped the Gutenberg Project and a subset of English Wikipedia to obtain the list of sentences that contain any. Next, using a combination of heuristics, we filtered the result with regular expressions to produce two sets of sentences (the second set underwent additional manual filtration): * 3844 sentences with sentential negation and a plural object with any to the right to the verb; * 330 sentences with nobody / no one as subject and a plural object with any to the right.

1 papers0 benchmarksTexts

Synthetic parallel sentences that contain any

We used the following procedure. First, we automatically identified the set of verbs and nouns to build our items from. To do so, we started with bert-base-uncased vocabulary. We ran all non-subword lexical tokens through a SpaCy POS. Further, we lemmatized the result using https://pypi.org/project/Pattern/ and dropped duplicates. Then, we filtered out modal verbs, singularia tantum nouns and some visible lemmatization mistakes. Finally, we filtered out non-transitive verbs to give the dataset a bit of a higher baseline of grammaticality.

1 papers0 benchmarksTexts

Extended Minecraft Corpus dataset

Minecraft Corpus dataset with builder utterance annotations

1 papers0 benchmarksImages, Texts

Twitter MediaEval (MediaEval Benchmarking Initiative for Multimedia Evaluation)

The task addresses the problem of the appearance and propagation of posts that share misleading multimedia content (images or video). In the context of the task, different types of misleading use are considered:

1 papers0 benchmarksImages, Texts

WikiBanEvasion (Wikipedia Ban Evasion Dataset)

A dataset comprising 8,551 ban evasion pairs on Wikipedia, where each pair comprises a parent account and the child account. We adopt a strategy to ensure that there is a 1:1 mapping between parent and child accounts. For each of the accounts in these ban evasion pairs, we provide the following data: - Wikipedia usernames, creation date, ban date, and other account-level meta-data - Corresponding edit information in form of revision IDs, pages edited, added text, deleted text, edit comment, and timestamp

1 papers0 benchmarksTexts

Water Footprint Recommender System Data

It contains data from two different realities: Food.com, a well-known American recipe site, and Planeat, an Italian site that allows you to plan recipes to save food waste. The dataset is divided into two parts: embeddings, which can be used directly to execute the work and receive suggestions, and raw data, which must first be processed into embeddings.

1 papers0 benchmarksTables, Texts

OAGT (Paper Topic Dataset)

OAGL is a paper topic dataset consisting of 6942930 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last two fields of each record are the topic id from a taxonomy of 27 topics created from the entire collection and the 20 most significant topic words. Each dataset record (sample) is stored as a JSON line in the text file.

1 papers0 benchmarksTexts

PreviousPage 118 of 158Next

Datasets

State Traversal Observation Tokens

Twitter PoS VCB (Twitter part-of-speech vote-constrained-bootstrapping)

Ritter PoS (Ritter Twitter part-of-speech tagging)

zulu-stance (Zulu Stance)

Nordic Language Identification

Bornholmsk

bajer_danish_misogyny (Bajer Online Misogyny)

SHAJ (Spoken Hate in the Albanian Jargon)

polstance (Political Stance in Danish)

Nakdimon-train

Natural sentences that contain *any*

Synthetic parallel sentences that contain *any*

Extended Minecraft Corpus dataset

Twitter MediaEval (MediaEval Benchmarking Initiative for Multimedia Evaluation)

WikiBanEvasion (Wikipedia Ban Evasion Dataset)

Water Footprint Recommender System Data

OAGT (Paper Topic Dataset)

Congolese Swahili – French parallel text corpora

language-modeling-recommendation

OrdinalDataset (Ordinal Encoding Data set)

Datasets

State Traversal Observation Tokens

Twitter PoS VCB (Twitter part-of-speech vote-constrained-bootstrapping)

Ritter PoS (Ritter Twitter part-of-speech tagging)

zulu-stance (Zulu Stance)

Nordic Language Identification

Bornholmsk

bajer_danish_misogyny (Bajer Online Misogyny)

SHAJ (Spoken Hate in the Albanian Jargon)

polstance (Political Stance in Danish)

Nakdimon-train

Natural sentences that contain *any*

Synthetic parallel sentences that contain *any*

Extended Minecraft Corpus dataset

Twitter MediaEval (MediaEval Benchmarking Initiative for Multimedia Evaluation)

WikiBanEvasion (Wikipedia Ban Evasion Dataset)

Water Footprint Recommender System Data

OAGT (Paper Topic Dataset)

Congolese Swahili – French parallel text corpora

language-modeling-recommendation

OrdinalDataset (Ordinal Encoding Data set)

Natural sentences that contain any

Synthetic parallel sentences that contain any

Natural sentences that contain any

Synthetic parallel sentences that contain any