3,148 machine learning datasets
3,148 dataset results
Parallel version of annotations in GUM RST v9.1.
RST corpus for Russian.
Diderot’s Encyclopédie is a reference work from XVIIIth century in Europe that aimed at collecting the knowledge of its era. This repository hosts an annotated dataset of more than 10,400 of the Encyclopédie entries with Wikidata identifiers enabling us to connect these entries to the Wikidata graph. The dataset can serve to train and evaluate named entity solvers.
The Guided Lexrank algorithm is applied to dataset special_appeal.csv to summarize the texts of legal documents. The obtained summary and the texts of the topics contained in dataset themes.csv are submitted to the BM25 algorithm for similarity assessment. From a list of topics, the GLARE method produces a ranking with suggested topics for a given document.
VisArgs is a densely annotated benchmark for visual argument understanding. It contains 1,611 images annotated with 5,112 visual premises (with regions), 5,574 commonsense premises, and reasoning trees connecting them into structured arguments. We propose three tasks for evaluating visual argument understanding: premise localization, premise identification, and conclusion.
RadCases Dataset This HuggingFace (HF) dataset contains the raw case labels for input patient "one-liner" case summaries according to the ACR Appropriateness Criteria. Because many of the sources of data used to construct the RadCases dataset require credentialed access, we cannot publicly release the input patient case summaries. Instead, the "cases" included in this publicly available dataset are the cryptographically secure SHA-512 hashes of the original, "human-readable" cases. In this way, the hashes cannot be used to reconstruct the original RadCases dataset, but can instead be used as a lookup key to determine the ground-truth label for the dataset.
Dataset Card for the ACR Appropriateness Criteria Corpus This dataset contains chunked guidelines and narratives from the ACR Appropriateness Criteria, an set of societal guidelines from the American College of Radiology (ACR) to help clinicians order appropriate diagnostic imaging studies for patients. The corpus is formatted similarly to the corpuses introduced in MedRAG by Xiong et al. (2024), and can therefore be similarly used for medical Retrieval-Augmented Generation (RAG).
Test-driven benchmark to challenge LLMs to write long JavaScript React application
The dataset comprises 1641 questions and answers generated as three separate parts. The first part of the dataset contains questions and answers that test the model’s ability to understand the current status of the environment and the provided rules. The second part of the dataset contains questions that test the model’s ability to generate valid actions, both in terms of syntax (JSON format) and semantics (validity in the specific state). The 3rd part of the dataset part aims to teach the fine-tuned models how to make correct decisions given a specific environment state.
A collection of large languge model responses to tasks of propositional logic. The responses are annotated according to the following criteria:
GenAIPABench is a specialized dataset designed to evaluate Generative AI-based Privacy Assistants (GenAIPAs). These assistants aim to simplify complex privacy policies and data protection regulations, making them more accessible and understandable to users. The dataset provides a comprehensive framework for assessing the performance of AI models in interpreting and explaining privacy-related documents.
MediConfusion is a challenging medical Visual Question Answering (VQA) benchmark dataset, that probes the failure modes of medical Multimodal Large Language Models (MLLMs) from a vision perspective. We reveal that state-of-the-art models are easily confused by image pairs that are otherwise visually dissimilar and clearly distinct for medical experts. <br /> Our benchmark consists of 176 confusing pairs. A confusing pair is a set of two images that share the same question and corresponding answer options, but the correct answer is different for the images. <br /> We evaluate models based on their ability to answer <i>both</i> questions correctly within a confusing pair, which we call <b>set accuracy</b>. This metric indicates how well models can tell the two images apart, as a model that selects the same answer option for both images for all pairs will receive 0% set accuracy. We also report <b>confusion</b>, a metric that describes the proportion of confusing pairs where the model ha
With the remarkable capability to reach the public instantly, social media has become integral in sharing scholarly articles to measure public response. This paper analyzes how Twitter bots interact with scholarly articles on the platform. Spamming by bots on social media can steer the conversation and present a false public interest in given research, affecting policies impacting the public's lives in the real world. In this paper, we determined whether bots are disseminating a given scholarly article based on analyzing the relationship between Twitter bots and several research factors. We developed and tested several supervised machine-learning classification models to tackle this problem. Through our analysis, we also identified that scholarly articles in health and human science are more prone to bot activity than other research areas.
WinoPron is a novel dataset of Winogender-like template pairs in English, which fixes inconsistencies in Winogender Schemas and contains balanced template pairs for pronoun forms in 3 grammatical cases, which we find impacts performance and bias evaluation.
IVM-Mix-1M provide over 1M image-instruction pairs with corresponding instruction-relevant mask labels. Our IVM-Mix-1M dataset consists of three part: HumanLabelData, RobotMachineData and VQAMachineData. For the HumanLabelData and RobotMachineData, we provide well-orgnized images, mask label and language instructions. For the VQAMachineData, we only provide mask label and language instructions, please refer to https://huggingface.co/datasets/2toINF/IVM-Mix-1M and download the images from constituting datasets.
A listwise multi-response dataset for human preferences alignment. The dataset is derived from UltraFeedback and SimPO.
This dataset contains 4606 articles from 1996 to 2024 that were presented in MIE (Medical Informatics Europe Conference) conferences. This data was extracted from PubMed and topic extraction and affiliation parsing were done on it.
This is a dataset of scientific documents derived from arXiv. It comprises 203,961 titles and abstracts categorized into 130 different classes from the arXiv category taxonomy. Each document (title+abstract) is categorized into one or more distinct classes. It is split into train (163,168), validation (20,396), and test (20,397) sets.
TruthGen is a dataset of generated true and false statements, intended for research on truthfulness in reward models and language models, specifically in contexts where political bias is undesirable. This dataset contains 1,987 statement pairs (3,974 statements in total), with each pair containing one objectively true statement and one false statement. It spans a variety of everyday and scientific facts, excluding politically charged topics to the greatest extent possible. The dataset is particularly useful for evaluating reward models trained for alignment with truth, as well as for research on mitigating political bias while improving model accuracy on truth-related tasks.
This is a super-parallel Bible corpus containing 1401 language labels (language_script pairs), meaning that for each verse, we have the translation in other languages. However, this results in a relatively low number of verses, with the current version supporting 103 verses per language.