Datasets

3,148 machine learning datasets

3,148 dataset results

PubMed PICO Element Detection Dataset

PICO is a framework to formulate a well-defined focused clinical question. This framework identifies the sentences in a given medical text that belong to the four components: Participants/Problem (P), Intervention (I), Comparison (C) and Outcome (O). The PubMed PICO Element Detection dataset is a dataset for evaluating models that automatically detect PICO elements.

2 papers0 benchmarksTexts

JNC (Japanese News Corpus)

The JNC data provides common supervision data for headline generation.

2 papers0 benchmarksTexts

FarsBase-KBP

FarsBase-KBP contains 22015 sentences, in which the entities and relation types are linked to the FarsBase ontology. This gold dataset can be reused for benchmarking KBP systems in the Persian language.

2 papers0 benchmarksTexts

Almawave-SLU

Almawave-SLU is the first Italian dataset for Spoken Language Understanding (SLU). It is derived through a semi-automatic procedure and is used as a benchmark of various open source and commercial systems.

2 papers0 benchmarksTexts

FSVQA (Full-Sentence Visual Question Answering)

Full-Sentence Visual Question Answering (FSVQA) dataset, consisting of nearly 1 million pairs of questions and full-sentence answers for images, built by applying a number of rule-based natural language processing techniques to original VQA dataset and captions in the MS COCO dataset.

2 papers0 benchmarksImages, Texts

VQA 360°

VQA 360° is a dataset for visual question answering on 360° images containing around 17,000 real-world image-question-answer triplets for a variety of question types.

2 papers0 benchmarksImages, Texts

ISOT Fake News Dataset

The ISOT Fake News dataset is a compilation of several thousands fake news and truthful articles, obtained from different legitimate news sites and sites flagged as unreliable by Politifact.com.

2 papers0 benchmarksTexts

Clinical Admission Notes from MIMIC-III

This dataset is created from MIMIC-III (Medical Information Mart for Intensive Care III) and contains simulated patient admission notes. The clinical notes contain information about a patient at admission time to the ICU and are labelled for four outcome prediction tasks: Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay.

2 papers6 benchmarksTexts

IG-3.5B-17k

IG-3.5B-17k is an internal Facebook AI Research dataset for training image classification models. It consists of hashtags for up to 3.5 billion public Instagram images.

2 papers0 benchmarksImages, Texts

DRI Corpus (Dr. Inventor Multi-layer Scientific Corpus)

The Dr. Inventor Multi-Layer Scientific Corpus (DRI Corpus) includes 40 Computer Graphics papers, selected by domain experts. Each paper of the Corpus has been annotated by three annotators by providing the following layers of annotations, each one characterizing a core aspect of scientific publications:

2 papers3 benchmarksTexts

CC-News (CommonCrawl News dataset)

CommonCrawl News is a dataset containing news articles from news sites all over the world. The dataset is available in form of Web ARChive (WARC) files that are released on a daily basis.

2 papers0 benchmarksTexts

StyleKQC

StyleKQC is a style-variant paraphrase corpus for korean questions and commands. It was built with a corpus construction scheme that simultaneously considers the core content and style of directives, namely intent and formality, for the Korean language. Utilizing manually generated natural language queries on six daily topics, the corpus was expanded to formal and informal sentences by human rewriting and transferring.

2 papers0 benchmarksTexts

THYME-2016

2 papers1 benchmarksMedical, Texts

BiasCorp

BiasCorp is a dataset for racism detection containing 139,090 comments and news segment from three specific sources - Fox News, BreitbartNews and YouTube.

2 papers0 benchmarksTexts

EtymDB 2.0

A multilingual etymological database extracted from the Wiktionary (described in Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0)

2 papers0 benchmarksTexts

NorDial

NorDial is the first step to creating a corpus of dialectal variation of written Norwegian. It consists of small corpus of tweets manually annotated as Bokmål, Nynorsk, any dialect, or a mix.

2 papers0 benchmarksTexts

Twitter Stance Election 2020

The data set contains 2500 manually-stance-labeled tweets, 1250 for each candidate (Joe Biden and Donald Trump). These tweets were sampled from the unlabeled set that our research team collected English tweets related to the 2020 US Presidential election. Through the Twitter Streaming API, the authors collected data using election-related hashtags and keywords. Between January 2020 and September 2020, over 5 million tweets were collected, not including quotes and retweets.

2 papers1 benchmarksTexts

Subjective Discourse

This is a discourse dataset with multiple and subjective interpretations of English conversation in the form of perceived conversation acts and intents. The dataset consists of witness testimonials in U.S. congressional hearings.

2 papers0 benchmarksTexts

Eedi Dataset

The Eedi dataset contains from two school years (September 2018 to May 2020) of students’ answers to mathematics questions from Eedi, a leading educational platform which millions of students interact with daily around the globe. Eedi offers diagnostic questions to students from primary to high school (roughly between 7 and 18 years old). Each diagnostic question is a multiple-choice question with 4 possible answer choices, exactly one of which is correct. Currently, the platform mainly focuses on mathematics questions.

2 papers0 benchmarksTexts

RTC (Reddit Time Corpus)

RTC is a benchmark corpus of social media comments sampled over three years. The corpus consists of 36.36m unlabelled comments for adaptation and evaluation on an upstream masked language modelling task as well as 0.9m labelled comments for finetuning and evaluation on a downstream document classification task. The Reddit Time Corpus (RTC) covers three years between March 2017 and February 2020 and is split into 36 evenly-sized monthly subsets based on comment timestamps. RTC is sampled from the Pushshift Reddit dataset.

2 papers0 benchmarksTexts

PreviousPage 88 of 158Next