Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

3,148 dataset results

WEB-FORUM-52 (WEB-FORUM-52 gold standard)

The WEB-FORUM-52 gold standard comprises (i) 13 web forums from the health domain, (ii) 15 forums obtained from a Wikipedia list of popular forums (https://en.wikipedia.org/wiki/List_of_Internet_forums), (iii) 13 forums mentioned on a list of popular German Web forums (https://www.beliebte-foren.de), (iv) nine forums obtained from WPressBlog (https://www.wpressblog.com/free-forum-posting-sites-list/) and (v) two additional forums. For most forums two web pages (from different threads) were used and stored together with gold standard annotations that have been manually created by domain experts and describe the post text, post date, post user and direct URL to the post.

1 papers0 benchmarksTexts

POLIT-FALSE-n-LEGIT NEWS DB 2016-2017

The LiT.RL POLIT-FALSE-n-LEGIT NEWS DB 2016-2017 contains a total of 274 news articles about U.S. Politics, content-matched in pairs of legitimate and falsified news. The database is free and released under an open license for educational and research purposes.

1 papers0 benchmarksTexts

SVDC Fake News Dataset

A labeled dataset that presents fake news surrounding the conflict in Syria. The dataset consists of a set of articles/news labeled by 0 (fake) or 1 (credible). Credibility of articles are computed with respect to a ground truth information obtained from the Syrian Violations Documentation Center (VDC). In particular, for each article, we crowdsource the information extraction (e.g., date, location, Number of casualties) job using the crowdsourcing platform Figure Eight (formally CrowdFlower). Then, we match those articles against the VDC database to be able to deduce whether an article is fake or not. The dataset can be used to train machine learning models to detect fake news.

1 papers0 benchmarksTexts

The Contextual TV Dataset (CTV)

Using the Experience-Sampling Method (ESM), participants are asked to report TV consumption multiple times each day for a five week period. Through self-reported data, authors decrease uncertainty of exposure to content, and allow collection of non-trivial information, such as how much attention is paid to the TV. The data is structured to accommodate quantitative analyses, e.g. in the CARS community, and is publicly available under the name Contextual TV (CTV) dataset.

1 papers0 benchmarksTexts

Metric-Type of Numerical Tables

Metric-Type of Numerical Tables is a dataset extracted from scientific papers (ACL anthology website) consisting of header tables, captions, and metric-types.

1 papers0 benchmarksTexts

Ubuntu Chat Corpus

The Ubuntu Chat Corpus (UCC) is composed of archived chat logs from Ubuntu's Internet Relay Chat technical support channels. Ubuntu uses IRC as one of many modes of technical support -- it offers real-time problem solving. The authors have taken some of the archived messages (which are in the public domain), reorganized the file structure, removed some unnecessary system messages, and compressed them to make it easier to obtain.

1 papers0 benchmarksTexts

Liu et al. Corpus

The Liu et al. Corpus is a pretraining dataset for large language models. It consists of 160Gb of news, books, stories, and web text.

1 papers0 benchmarksTexts

B-T4SA

1 papers1 benchmarksImages, Texts

AbstRCT - Neoplasm

The AbstRCT dataset consists of randomized controlled trials retrieved from the MEDLINE database via PubMed search. The trials are annotated with argument components and argumentative relations.

1 papers3 benchmarksTexts

An Amharic News Text classification Dataset

In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.

1 papers2 benchmarksTexts

DODa (Darija Open Dataset)

Darija Open Dataset (DODa) is an open-source project for the Moroccan dialect. With more than 10,000 entries DODa is arguably the largest open-source collaborative project for Darija-English translation built for Natural Language Processing purposes. In fact, besides semantic categorization, DODa also adopts a syntactic one, presents words under different spellings, offers verb-to-noun and masculine-to-feminine correspondences, contains the conjugation of hundreds of verbs in different tenses, and many other subsets to help researchers better understand and study Moroccan dialect.

1 papers0 benchmarksTexts

LeT-Mi (Levantine Twitter dataset for Misogynistic language)

Levantine Twitter dataset for Misogynistic language (LeT-Mi) is an Arabic Levantine Twitter dataset for misogynistic language to be the first benchmark dataset for Arabic misogyny.

1 papers0 benchmarksTexts

Autoencoder Paraphrase Dataset (AEPD)

This is a benchmark for neural paraphrase detection, to differentiate between original and machine-generated content.

1 papers0 benchmarksTexts

Machine Prarphrase Corpus (MPC)

This dataset is used to train and evaluate models for the detection of machine-paraphrased text.

1 papers0 benchmarksTexts

Twitter Abusive Context

This dataset for abusive content detection in Twitter consists of two sets of annotations for the same set of tweets, one where the human annotators had access to the tweet's content and one where they didn't know the context.

1 papers0 benchmarksTexts

RUSS Dataset

RUSS (Rapid Universal Support Service) is a dataset that consists of a collection of 741 real-world step-by-step natural language instructions (raw and annotated) from the open web, and for each: its corresponding webpage DOM, ground-truth ThingTalk, and ground-truth actions.

1 papers0 benchmarksTexts

RepLab 2013

RepLab 2013 dataset uses Twitter data in English and Spanish (more than 142,000 tweets). The balance between both languages depends on the availability of data for each of the entities included in the dataset. The corpus consists of a collection of tweets referring to a selected set of 61 entities from four domains: automotive, banking, universities and music/artists. The domain selection was done to offer a variety of scenarios for reputation studies.

1 papers0 benchmarksTexts

A2Dre (Subset of A2D Sentences which are not trivial)

We obtain A2Dre by selecting only instances that were labeled as non-trivial, which are 433 REs from 190 videos. We do not use the trivial cases as the analysis of such examples is not relevant, as referents can be described by using the category alone. Each annotator was presented with a RE, a video in which the target object was marked by a bounding box, and a set of questions paraphrasing our categories. A2Dre was annotated by 3 authors of the paper. Our final set of category annotations used for analysis was derived by means of majority voting: for each nontrivial RE, we kept all category labels which were assigned to the RE by at least two annotators.

1 papers0 benchmarksTexts

A2Dre+ (Extension of A2D sentences where trivial cases where filtered)

A2Dre is a subset from the A2D test set including $433$~\textit{non-trivial} REs. Due to its highly unbalanced distribution across the $7$~semantic categories we select the $4$~major categories \textsl{appearance, location, motion and static}. The four categories have in common that in most cases, for a given referent, a RE can be provided that expresses a certain category, and one that does not. We use these categories to augment A2Dre with additional REs, which vary according to the presence or absence of each of them. Specifically, based on our categorization of the original REs, for each RE~$re$ and category~$C$, we produce an additional RE~$re'$ by modifying $re$ slightly such that it does (or does not) express~$C$. For example, for the last RE in Figure~\ref{fig:a2d-images}, i.e. \emph{girl in yellow dress standing near the woman}, which could be categorized as \textit{appearance}, \textit{location}, no \textit{motion} and \textit{static}, we produce new REs for each category:

1 papers0 benchmarksTexts

JUSThink Dialogue and Actions Corpus

The information contained in JUSThink Dialogue and Actions Corpus dataset includes dialogue transcripts, event logs, and test responses of children aged 9 through 12, as they participate in a robot-mediated human-human collaborative learning activity named JUSThink, where children in teams of two solve a problem on graphs together.

1 papers0 benchmarksTexts

PreviousPage 108 of 158Next