3,148 machine learning datasets
3,148 dataset results
This is the dataset used for classifying Gene-Disease relationship types from sentences. The dataset consists of 3 files:
Using Council Data Project infrastructures (https://councildataproject.org), we assemble longitudinal municipal council meeting transcript data. This initial release of the Councils in Action dataset includes over 350 meetings of the city councils of Seattle Washington and Portland Oregon, and the county council of King County Washington.
There are 9,321 survey papers with high quality included in the SurvayBank in the domain of computer science.
PLOD: An Abbreviation Detection Dataset
This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.
This dataset releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 3 collapsed tags (PER, LOC, ORG).
PLOD: An Abbreviation Detection Dataset
COVMis-Stance is a stance detection dataset for COVID-19 misinformation. It consists of fake news and claims related to COVID. Fake news was collected from articles fact-checking sites, and fake claims were from the WHO official Twitter. It contains 2631 tweets annotated for stance towards 111 COVID19 misinformation items.
TASTEset Recipe Dataset and Food Entities Recognition is a dataset for Named Entity Recognition (NER) which consists of 700 recipes with more than 13,000 entities to extract.
A large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pirá is a crowdsourced question answering (QA) dataset on the ocean and the Brazilian coast designed for reading comprehension.
TuGebic is a corpus of recordings of spontaneous speech samples from Turkish-German bilinguals, and the compilation of a corpus called TuGebic. Participants in the study were adult Turkish and German bilinguals living in Germany or Turkey at the time of recording in the first half of the 1990s. The data were manually tokenised and normalised, and all proper names (names of participants and places mentioned in the conversations) were replaced with pseudonyms. Token-level automatic language identification was performed, which made it possible to establish the proportions of words from each language.
The ExaASC dataset is a dataset for Target-based Stance Detection in the Arabic Language that contains different types of targets like persons, entities and events. This corpus contains about 9500 tweets with replies and target specified in the source tweet. Each sample has at least two stance annotations provided by Exa Corporation annotators. The stance of each reply is annotated toward the target in the corresponding source tweet. Format of data is as follows: id, main (source tweet), reply, target, label of each annotator id and majority_label.
Contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database.
A corpus of 9k German and French user comments collected from migration-related news articles. It goes beyond the hate-neutral dichotomy and is instead annotated with 23 features, which in combination become descriptors of various types of speech, ranging from critical comments to implicit and explicit expressions of hate. The annotations are performed by 4 native speakers per language and achieve high (0.77) inter-annotator agreements.
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is conta
An open-source online generative dictionary that takes a word and context containing the word as input and automatically generates a definition as output. Incorporating state-of-the-art definition generation models, it supports not only Chinese and English, but also Chinese-English cross-lingual queries. Moreover, it has a user-friendly front-end design that can help users understand the query words quickly and easily.
CLUES is a benchmark for Classifier Learning Using natural language ExplanationS, consisting of a range of classification tasks over structured data along with natural language supervision in the form of explanations. CLUES consists of 36 real-world (CLUES-Real) and 144 synthetic (CLUES-Synthetic) classification tasks. It contains crowdsourced explanations describing real-world tasks from multiple teachers and programmatically generated explanations for the synthetic tasks.
Diacritized texts in Modern Hebrew, collected from eleven different sources. Diacritized using Ktiv Male conventions.
A dataset automatically generated using question generation neural models and alt-text video captions from the WebVid dataset, with 3M video-question-answer triplets.
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). You can download the non-aggregated results (55,000 rows) here.