3,148 machine learning datasets
3,148 dataset results
It is a freely available resource for research on handling negation and uncertainty in biomedical texts . The corpus consists of three parts, namely medical free texts,biological full papers and biological scientific abstracts. The dataset contains annotations at the token level for negative and speculative keywords and at the sentence level for their linguistic scope. The annotation process was carried out by two independent linguist annotators and a chief annotator – also responsible for setting up the annotation guidelines – who resolved cases where the annotators disagreed.
Contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media.
DR.BENCH is a dataset for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation.
CommitBART is a benchmark for researching commit-related task such as denoising, cross-modal generation and contrastive learning. The dataset contains over 7 million commits across 7 programming languages.
The CICEROv2 dataset can be found in the data directory. Each line of the files is a json object indicating a single instance. The json objects have the following key-value pairs:
Question Answering (QA) is a widely-used framework for developing and evaluating an intelligent machine. In this light, QA on Electronic Health Records (EHR), namely EHR QA, can work as a crucial milestone toward developing an intelligent agent in healthcare. EHR data are typically stored in a relational database, which can also be converted to a directed acyclic graph, allowing two approaches for EHR QA: Table-based QA and Knowledge Graph-based QA.
The TREC News Track features modern search tasks in the news domain. In partnership with The Washington Post, we are developing test collections that support the search needs of news readers and news writers in the current news environment. It's our hope that the track will foster research that establishes a new sense for what "relevance" means for news search.
Situated Dialogue Navigation (SDN) is a navigation benchmark of 183 trials with a total of 8415 utterances, around 18.7 hours of control streams, and 2.9 hours of trimmed audio. SDN is developed to evaluate the agent's ability to predict dialogue moves from humans as well as generate its own dialogue moves and physical navigation actions.
Title2Event is a large-scale sentence-level dataset for benchmarking Open Event Extraction without restricting event types. Title2Event contains more than 42,000 news titles in 34 topics collected from Chinese web pages.
Conversational Stance Detection (CSD) is a dataset with annotations of stances and the structures of conversation threads. It consists of 500 conversation threads (including 500 posts and 5376 comments) from six major social media platforms in Hong Kong.
Stanceosaurus is a corpus of 28,033 tweets in English, Hindi, and Arabic annotated with stance towards 251 misinformation claims. The claims in Stanceosaurus originate from 15 fact-checking sources that cover diverse geographical regions and cultures. Unlike existing stance datasets, it introduces a more fine-grained 5-class labeling strategy with additional subcategories to distinguish implicit stance.
A vast amount of information in the biomedical domain is available as natural language free text. An increasing number of documents in the field are written in languages other than English. Therefore, it is essential to develop resources, methods and tools that address Natural Language Processing in the variety of languages used by the biomedical community. In this paper, we report on the development of an extensive corpus of biomedical documents in French annotated at the entity and concept level. Three text genres are covered, comprising a total of 103,056 words. Ten entity categories corresponding to UMLS Semantic Groups were annotated, using automatic pre-annotations validated by trained human annotators. The pre-annotation method was found helful for entities and achieved above 0.83 precision for all text genres. Overall, a total of 26,409 entity annotations were mapped to 5,797 unique UMLS concepts.
Secim2023 is a comprehensive dataset for social media researchers to study the upcoming election, develop tools to prevent online manipulation, and gather novel information to inform the public.
Text Classification Attack Benchmark (TCAB) is a dataset for analyzing, understanding, detecting, and labeling adversarial attacks against text classifiers. TCAB includes 1.5 million attack instances, generated by twelve adversarial attack targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. The process of generating attacks is automated, so that TCAB can easily be extended to incorporate new text attacks and better classifiers as they are developed.
Geoclidean-Constraints dataset consists of 20 concepts and 40 tasks, created from permutations of line and circle construction rules with various constraints describing the relationship between objects. This dataset focuses on explicit constraints between geometric objects. We denote the objects as the following—lines as L, circles as C, and triangles (constructed from three lines) as T.
A Dataset for Politeness Classification in Nine Typologically Diverse Languages (TyDiP) is a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples.
Corpus of Egyptian Arabic-English Code-switching (ArzEn) is a spontaneous conversational speech corpus, obtained through informal interviews held at the German University in Cairo. The participants discussed broad topics, including education, hobbies, work, and life experiences. The corpus currently contains 12 hours of speech, having 6,216 utterances. The recordings were transcribed and translated into monolingual Egyptian Arabic and monolingual English.
NarraSum is a large-scale narrative summarization dataset. It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries.
NusaCrowd is a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, the authors have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments.
RoFT is a dataset of 21,000 human annotations of generated text. The task is "Boundary detection" i.e. given a passage that starts off as human written, determine when the text transitions to being machine generated. The dataset also includes error annotations using the taxonomy introduced in the paper. The data can be used to train automatic detection systems, train automatic error correction, analyze visibility of model errors, and compare performance across models. Data was collected using http://roft.io.