Datasets

3,148 machine learning datasets

3,148 dataset results

AW-OIE (All Words OpenIE)

All Words Open IE (AW-OIE) is an open information extraction dataset derived from Question-Answer Meaning Representation (QAMR) dataset.

3 papers0 benchmarksTexts

RETWEET

RETWEET is a dataset of tweets and overall predominant sentiment of their replies.

3 papers4 benchmarksTexts

ShARe/CLEF 2014: Task 2 Disorders

3 papers1 benchmarksMedical, Texts

Summarizing Source Code using a Neural Attention Model

Presents a new dataset of code snippets with short descriptions, created using data gathered from Stackoverflow, a popular programming help website. Since access is open and unrestricted, the content is inherently noisy (ungrammatical, non-parsable, lacking content).

3 papers0 benchmarksTexts

PhoNER COVID19

PhoNER_COVID19 is a dataset for recognising COVID-19 related named entities in Vietnamese, consisting of 35K entities over 10K sentences. The authors defined 10 entity types with the aim of extracting key information related to COVID-19 patients, which are especially useful in downstream applications. In general, these entity types can be used in the context of not only the COVID-19 pandemic but also in other future epidemics.

3 papers1 benchmarksTexts

FreSaDa

FreSaDa is a French satire dataset for cross-domain satire detection, which is composed of 11,570 articles from the news domain. The dataset samples have been split into training, validation and test, such that the training publication sources are distinct from the validation and test publication sources. This gives rise to a cross-domain (cross-source) satire detection task.

3 papers0 benchmarksTexts

FixMyPose

FixMyPose is a dataset for automated pose correction. It consists of descriptions to correct a "current" pose to look like a "target" pose, in English and Hindi. The collected descriptions have interesting linguistic properties such as egocentric relations to environment objects, analogous references, etc., requiring an understanding of spatial relations and commonsense knowledge about postures.

3 papers0 benchmarksImages, Texts

XLEnt

XLEnt consists of parallel entity mentions in 120 languages aligned with English. These entity pairs were constructed by performing named entity recognition (NER) and typing on English sentences from mined sentence pairs. These extracted English entity labels and types were projected to the non-English sentences through word alignment. Word alignment was performed by combining three alignment signals ((1) word co-occurence alignment with FastAlign (2) semantic alignment using LASER embeddings, and (3) phonetic alignment via transliteration) into a unified word-alignment model. This lexical/semantic/phonetic alignment approach yielded more than 160 million aligned entity pairs in 120 languages paired with English. Recognizing that each English is often aligned to mulitple entities in different target languages, we can join on English entities to obtain aligned entity pairs that directly pair two non-English entities (e.g., Arabic-French)

3 papers0 benchmarksTexts

PatTR (Patent Translation Resource)

PatTR is a sentence-parallel corpus extracted from the MAREC patent collection. The current version contains more than 22 million German-English and 18 million French-English parallel sentences collected from all patent text sections as well as 5 million German-French sentence pairs from patent titles, abstracts and claims.

3 papers0 benchmarksTexts

italki NLI

A large, crowd-sourced dataset for the Native Language Identification (NLI) task. People learning English as a second language write practice Notebooks which can be used to classify the author's native language using word choice, spelling mistakes and other language features.

3 papers1 benchmarksTexts

Hate Counter

This dataset is built from Twitter and contains 1290 hate tweet and counterspeech reply pairs. After the annotation process, the dataset consists of 558 unique hate tweets from 548 user and 1290 counterspeech replies from 1239 users.

3 papers0 benchmarksTexts

Natural Hazards Twitter Dataset

Natural Hazards is a natural disaster dataset with sentiment labels, which contains nearly 50,00 Twitter data about different natural disasters in the United States (e.g., a tornado in 2011, a hurricane named Sandy in 2012, a series of floods in 2013, a hurricane named Matthew in 2016, a blizzard in 2016, a hurricane named Harvey in 2017, a hurricane named Michael in 2018, a series of wildfires in 2018, and a hurricane named Dorian in 2019).

3 papers0 benchmarksTexts

UPFD-GOS (User Preference-aware Fake News Detection)

The Gossipcop variant of the UPFD dataset for benchmarking.

3 papers2 benchmarksGraphs, Texts

Healthline

Healthline is a nutrition related dataset for multi-document summarization, using scientific studies.

3 papers0 benchmarksTexts

P3 (Psychophysical Patterns Dataset)

A set of patterns used in psychophysical research to evaluate the ability of saliency algorithms to find targets distinct from distractors in orientation, color and size. Each image is a 7x7 grid and contains a single target. All images are 1024x1024px and have corresponding ground truth masks for the target and distractors.

3 papers0 benchmarksImages, Texts

SaRoCo

SaRoCo is a dataset for detecting satire in Romanian news containing 55,608 news articles from multiple real and satirical news sources, of which 27,980 are regular and 27,628 satirical news reports. We provide the data in csv format, in three files following the train/validation/test splits.

3 papers0 benchmarksTexts

UIT-ViWikiQA

The UIT-ViWikiQA is a dataset for evaluating sentence extraction-based machine reading comprehension in the Vietnamese language. The UIT-ViWikiQA dataset is converted from the UIT-ViQuAD dataset, consisting of 23,074 question-answers based on 5,109 passages of 174 Vietnamese articles from Wikipedia.

3 papers0 benchmarksTexts

NEMO-Corpus (NEMO Hebrew NER and Morphology Corpus)

Named Entity (NER) annotations of the Hebrew Treebank (Haaretz newspaper) corpus, including: morpheme and token level NER labels, nested mentions, and more. We publish the NEMO corpus in the TACL paper "Neural Modeling for Named Entities and Morphology (NEMO^2)" [1], where we use it in extensive experiments and analyses, showing the importance of morphological boundaries for neural modeling of NER in morphologically rich languages. Code for these models and experiments can be found in the NEMO code repo.

3 papers1 benchmarksTexts

TaL Corpus (The Tongue and Lips Corpus)

The Tongue and Lips (TaL) corpus is a multi-speaker corpus of ultrasound images of the tongue and video images of lips. This corpus contains synchronised imaging data of extraoral (lips) and intraoral (tongue) articulators from 82 native speakers of English.

3 papers0 benchmarksAudio, Speech, Texts, Videos

MRS (Multilingual Reply Suggestion)

MRS, a multilingual reply suggestion dataset with ten languages. MRS can be used to compare two families of models: 1) retrieval models that select the reply from a fixed set and 2) generation models that produce the reply from scratch. Therefore, MRS complements existing cross-lingual generalization benchmarks that focus on classification and sequence labeling tasks.

3 papers0 benchmarksTexts

PreviousPage 76 of 158Next