Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

3,148 dataset results

Dataset: Relationship extraction for knowledge graph creation from biomedical literature (Gene-Disease relationships)

This is the dataset used for classifying Gene-Disease relationship types from sentences. The dataset consists of 3 files:

1 papers1 benchmarksTexts

Councils in Action

Using Council Data Project infrastructures (https://councildataproject.org), we assemble longitudinal municipal council meeting transcript data. This initial release of the Councils in Action dataset includes over 350 meetings of the city councils of Seattle Washington and Portland Oregon, and the county council of King County Washington.

1 papers0 benchmarksTexts

SurveyBank

There are 9,321 survey papers with high quality included in the SurvayBank in the domain of computer science.

1 papers0 benchmarksTexts

PLOD-unfiltered (PLOD: An Abbreviation Detection Dataset for Scientific Documents)

PLOD: An Abbreviation Detection Dataset

1 papers0 benchmarksTexts

HiNER-original (HiNER: A Large Hindi Named Entity Recognition Dataset)

This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.

1 papers1 benchmarksTexts

HiNER-collapsed (HiNER: A Large Hindi Named Entity Recognition Dataset)

This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 3 collapsed tags (PER, LOC, ORG).

1 papers1 benchmarksTexts

PLOD-filtered (PLOD: An Abbreviation Detection Dataset for Scientific Documents)

PLOD: An Abbreviation Detection Dataset

1 papers0 benchmarksTexts

COVMis-Stance

COVMis-Stance is a stance detection dataset for COVID-19 misinformation. It consists of fake news and claims related to COVID. Fake news was collected from articles fact-checking sites, and fake claims were from the WHO official Twitter. It contains 2631 tweets annotated for stance towards 111 COVID19 misinformation items.

1 papers0 benchmarksTexts

TASTEset

TASTEset Recipe Dataset and Food Entities Recognition is a dataset for Named Entity Recognition (NER) which consists of 700 recipes with more than 13,000 entities to extract.

1 papers0 benchmarksTexts

Pirá (Pirá: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean)

A large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pirá is a crowdsourced question answering (QA) dataset on the ocean and the Brazilian coast designed for reading comprehension.

1 papers0 benchmarksTexts

TuGebic (A Turkish-German Bilingual Code-Switching Corpus)

TuGebic is a corpus of recordings of spontaneous speech samples from Turkish-German bilinguals, and the compilation of a corpus called TuGebic. Participants in the study were adult Turkish and German bilinguals living in Germany or Turkey at the time of recording in the first half of the 1990s. The data were manually tokenised and normalised, and all proper names (names of participants and places mentioned in the conversations) were replaced with pseudonyms. Token-level automatic language identification was performed, which made it possible to establish the proportions of words from each language.

1 papers0 benchmarksTexts

ExaASC

The ExaASC dataset is a dataset for Target-based Stance Detection in the Arabic Language that contains different types of targets like persons, entities and events. This corpus contains about 9500 tweets with replies and target specified in the source tweet. Each sample has at least two stance annotations provided by Exa Corporation annotators. The stance of each reply is annotated toward the target in the corresponding source tweet. Format of data is as follows: id, main (source tweet), reply, target, label of each annotator id and majority_label.

1 papers0 benchmarksTexts

MeSHup (A Corpus for Full Text Biomedical Document Indexing)

Contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database.

1 papers0 benchmarksTexts

M-Phasis (A Feature-Based Corpus of Hate Online)

A corpus of 9k German and French user comments collected from migration-related news articles. It goes beyond the hate-neutral dichotomy and is instead annotated with 23 features, which in combination become descriptors of various types of speech, ranging from critical comments to implicit and explicit expressions of hate. The annotations are performed by 4 native speakers per language and achieve high (0.77) inter-annotator agreements.

1 papers0 benchmarksTexts

Monant Medical Misinformation

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is conta

1 papers0 benchmarksTexts

LitMind Dictionary

An open-source online generative dictionary that takes a word and context containing the word as input and automatically generates a definition as output. Incorporating state-of-the-art definition generation models, it supports not only Chinese and English, but also Chinese-English cross-lingual queries. Moreover, it has a user-friendly front-end design that can help users understand the query words quickly and easily.

1 papers0 benchmarksTexts

CLUES (Classifier Learning Using natural language ExplanationS)

CLUES is a benchmark for Classifier Learning Using natural language ExplanationS, consisting of a range of classification tasks over structured data along with natural language supervision in the form of explanations. CLUES consists of 36 real-world (CLUES-Real) and 144 synthetic (CLUES-Synthetic) classification tasks. It contains crowdsourced explanations describing real-world tasks from multiple teachers and programmatically generated explanations for the synthetic tasks.

1 papers0 benchmarksTexts

Nakdimon-test

Diacritized texts in Modern Hebrew, collected from eleven different sources. Diacritized using Ktiv Male conventions.

1 papers0 benchmarksTexts

WebVidVQA3M

A dataset automatically generated using question generation neural models and alt-text video captions from the WebVid dataset, with 3M video-question-answer triplets.

1 papers0 benchmarksTexts, Videos

Twitter US Airline Sentiment

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). You can download the non-aggregated results (55,000 rows) here.

1 papers0 benchmarksTexts

PreviousPage 117 of 158Next