Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

3,148 dataset results

IBM Debater Mention Detection Benchmark

This dataset contains general and named entities annotations on both clean written text and on noisy speech data. It includes 1000 sentences from Wikipedia and 1000 sentences of speech data that appear in two forms: (1) transcribed manually, and (2) the output of an ASR engine. Each of the datasets includes a total of around 6500 mentions linked to there DBPedia pages.

1 papers0 benchmarksTexts

HumanMT

HumanMT is a collection of human ratings and corrections of machine translations. It consists of two parts: The first part contains five-point and pairwise sentence-level ratings, the second part contains error markings and corrections. Details are described in the following.

1 papers0 benchmarksTexts

MetaCLIR

This data adds textual meta-infomation data to two existing corpora for cross language information retrieval: BoostCLIR, and the Large Scale CLIR Dataset (wiki-clir).

1 papers0 benchmarksTexts

COPA-HR

The COPA-HR dataset (Choice of plausible alternatives in Croatian) is a translation of the English COPA dataset by following the XCOPA dataset translation methodology. The dataset consists of 1000 premises (My body cast a shadow over the grass), each given a question (What is the cause?), and two choices (The sun was rising; The grass was cut), with a label encoding which of the choices is more plausible given the annotator or translator (The sun was rising).

1 papers0 benchmarksTexts

Credibility Factors 2020

This dataset focuses on 50 articles about climate science, which were annotated completely by 49 students, 26 Upwork workers, 3 science and 3 journalism experts.

1 papers0 benchmarksTexts

SumeCzech-NER

SumeCzech-NER contains named entity annotations of SumeCzech 1.0, a Czech news-based summarization dataset.

1 papers0 benchmarksTexts

ECTF (Early COVID-19 Twitter Fake news)

ECTF is a dataset for Twitter fake news detection in the Covid-19 domain.

1 papers0 benchmarksTexts

AbuseAnalyzer Dataset

The dataset contains 7,601 Gab posts classified on three different aspects: abuse presence or not, abuse severity and abuse target.

1 papers0 benchmarksTexts

FTR-18

FTR-18 is a multilingual rumour dataset on football transfer news. Transfer rumours are continuously published by sports media. They can both harm the image of player or a club or increase the player's market value. The proposed dataset includes transfer articles written in English, Spanish and Portuguese. It also comprises Twitter reactions related to the transfer rumours. FTR-18 is suited for rumour classification tasks and allows the research on the linguistic patterns used in sports journalism.

1 papers0 benchmarksTexts

PersianQA (Persian Question Answering Dataset)

PersianQA: a dataset for Persian Question Answering Persian Question Answering (PersianQA) Dataset is a reading comprehension dataset on Persian Wikipedia. The crowd-sourced the dataset consists of more than 9,000 entries. Each entry can be either an impossible-to-answer or a question with one or more answers spanning in the passage (the context) from which the questioner proposed the question. Much like the SQuAD2.0 dataset, the impossible or unanswerable questions can be utilized to create a system which "knows that it doesn't know the answer".

1 papers0 benchmarksTexts

HoaxItaly

HoaxItaly consists of over 1 million tweets shared during 2019 and containing links to thousands of news articles published on two classes of Italian outlets: (1) disinformation websites, i.e. outlets which have been repeatedly flagged by journalists and fact-checkers for producing low-credibility content such as false news, hoaxes, click-bait, misleading and hyper-partisan stories; (2) fact-checking websites which notably debunk and verify online news and claims. The dataset includes title and body for approximately 37k news articles.

1 papers0 benchmarksGraphs, Texts

A Dataset of State-Censored Tweets

This is a dataset of 583,437 tweets by 155,715 users that were censored between 2012-2020 July. It also contains 4,301 accounts that were censored in their entirety. Additionally, another set of tweets is related, consisting of 22,083,759 supplemental tweets made up of all tweets by users with at least one censored tweet as well as instances of other users retweeting the censored user.

1 papers0 benchmarksTexts

CoronaVis

CoronaVis is a dataset of tweets related to coronavirus.

1 papers0 benchmarksTexts

Apiza Corpus

The Apiza Corpus is a WoZ-like (Wizard of Oz) set of dialogues between 30 programmers and a simulated virtual assistant. This corpus can be used to study or train a virtual assistant for software engineering.

1 papers0 benchmarksTexts

BLM-17m

BLM-17m is a labeled dataset for topic detection that contains 17 million tweets. These Tweets are collected from 25 May 2020 to 21 August 2020 that covers 89 days from start of the George Floy incident. The dataset was labelled by monitoring most trending news topics from global and local newspapers.

1 papers0 benchmarksTexts

Peer to Peer Hate

Peer to Peer Hate is a comprehensive hate speech dataset capturing various types of hate. It has been built from 27,330 hate speech tweets.

1 papers0 benchmarksTexts

Xamarin Q&A

Xamarin Q&A consists of two datasets of questions and answers for studying the development of cross-platform mobile applications using the Xamarin framework. The two datasets were created by mining two Q&A sites: Xamarin Forum and Stack Overflow. The datasets have 85,908 questions mined from the Xamarin Forum and 44,434 from Stack Overflow.

1 papers0 benchmarksTexts

Reddit Norm Violations

This is a dataset of over 40K Reddit comments removed by moderators according to the specific type of macro norm being violated.

1 papers0 benchmarksTexts

IAPR TC-12 (IAPR TC-12 Benchmark)

The image collection of the IAPR TC-12 Benchmark consists of 20,000 still natural images taken from locations around the world and comprising an assorted cross-section of still natural images. This includes pictures of different sports and actions, photographs of people, animals, cities, landscapes, and many other aspects of contemporary life. Each image is associated with a text caption in up to three different languages (English, German and Spanish).

1 papers0 benchmarksImages, Texts

Wiki-Reliability

Wiki-Reliability is the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues. Templates are tags used by expert Wikipedia editors to indicate content issues, such as the presence of "non-neutral point of view" or "contradictory articles", and serve as a strong signal for detecting reliability issues in a revision. We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative with respect to each template. Each positive/negative example in the dataset comes with the full article text and 20 features from the revision's metadata. We provide an overview of the possible downstream tasks enabled by such data, and show that Wiki-Reliability can be used to train large-scale models for content reliability prediction.

1 papers0 benchmarksTexts

PreviousPage 109 of 158Next