Datasets

3,148 machine learning datasets

3,148 dataset results

GENTYPES (Gender Stereotypes)

This dataset contains short sentences linking a first name, represented by the template mask [NAME], to stereotypical associations.

1 papers0 benchmarksTexts

Offensive Memes in Singapore Context

This dataset is a collection of memes from various existing datasets, online forums, and freshly scrapped contents. It contains both global-context memes and Singapore-context memes, in different splits. It has textual description and a label stating if it is offensive under Singapore society's standards. It can be used to train content moderation models in a culturally complex society.

1 papers0 benchmarksImages, Texts

MIMIC-IV v2.2

Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy. The Medical Information Mart for Intensive Care (MIMIC)-III database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC). Importantly, MIMIC-III was deidentified, and patient identifiers were removed according to the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. MIMIC-III has been integral in driving large amounts of research in clinical informatics, epidemiology, and machine learning. Here we present MIMIC-IV, an update to MIMIC-III, which incorporates contemporary data and improves on numerous aspects of MIMIC-III. MIMIC-IV adopts a modular approach to data organization, highlighting d

1 papers0 benchmarksTexts

MIMIC-IV-Note (MIMIC-IV-Note: Deidentified free-text clinical notes)

The advent of large, open access text databases has driven advances in state-of-the-art model performance in natural language processing (NLP). The relatively limited amount of clinical data available for NLP has been cited as a significant barrier to the field's progress. Here we describe MIMIC-IV-Note: a collection of deidentified free-text clinical notes for patients included in the MIMIC-IV clinical database. MIMIC-IV-Note contains 331,794 deidentified discharge summaries from 145,915 patients admitted to the hospital and emergency department at the Beth Israel Deaconess Medical Center in Boston, MA, USA. The database also contains 2,321,355 deidentified radiology reports for 237,427 patients. All notes have had protected health information removed in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. All notes are linkable to MIMIC-IV providing important context to the clinical data therein. The database is intended to stimulate

1 papers0 benchmarksTexts

LMSYS-USP

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

Olympic 2024

Olympic 2024 is a human-annotated dataset that contains 220 high-quality instance. Each instance consists of an input tuple (user instruction, response 1, response confidence of response 1, response 2, response confidence of response 2) and an output tuple (evaluation explanation, evaluation result). The evaluation result would be either ‘1’ or ‘2’, indicating that response 1 or response 2 is better. To ensure the quality of human annotations, we involve three experts to concurrently annotate the same data point during the annotation process.

1 papers0 benchmarksTexts

MF (Mathematical Formulas)

Mathematical dataset containing formulas based on the AMPS Khan dataset and the ARQMath dataset V1.3. Based on the retrieved LaTeX formulas, more equivalent versions have been generated by applying randomized LaTeX printing with this SymPy fork. The formulas are intended to be well applicable for MLM. For instance, a masking for a formula like (a+b)^2 = a^2 + 2ab + b^2 makes sense (e.g., (a+[MASK])^2 = a^2 + [MASK]ab + b[MASK]2 -> masked tokens are deducable by the context), in contrast, formulas such as f(x) = 3x+1 are not (e.g., [MASK](x) = 3x[MASK]1 -> [MASK] tokens are ambigious).

1 papers0 benchmarksTexts

MT (Mathematical Texts)

Mathematical dataset containing mathematical texts, i.e., texts containing LaTeX formulas, based on the AMPS Khan dataset and the ARQMath dataset V1.3. Based on the retrieved LaTeX texts, more mathematically equivalent versions have been generated by applying randomized LaTeX printing with this SymPy fork. A positive id corresponds to the ARQMath post id of the generated text version, a negative id indicates an AMPS text.

1 papers0 benchmarksTexts

NMF (Named Mathematical Formulas)

Mathematical dataset based on 71 famous mathematical identities. Each entry consists of a name of the identity (name), a representation of that identity (formula), a label whether the representation belongs to the identity (label), and an id of the mathematical identity (formula_name_id). The false pairs are intentionally challenging, e.g., a^2+2^b=c^2as falsified version of the Pythagorean Theorem. All entries have been generated by using data.json as starting point and applying the randomizing and falsifying algorithms from MathMutator (MAMUT). The formulas in the dataset are not just pure mathematical, but contain also textual descriptions of the mathematical identity. At most 400000 versions are generated per identity. There are ten times more falsified versions than true ones, such that the dataset can be used for a training with changing false examples every epoch.

1 papers0 benchmarksTexts

Mathematical Formula Retrieval (MFR (Mathematical Formula Retrieval))

Mathematical dataset based on 71 famous mathematical identities. Each entry consists of two identities (in formula or textual form), together with a label, whether the two versions describe the same mathematical identity. The false pairs are not randomly chosen, but intentionally hard by modifying equivalent representations (see ddrg/named_math_formulas for more information). At most 400000 versions are generated per identity. There are ten times more falsified versions than true ones, such that the dataset can be used for a training with changing false examples every epoch.

1 papers0 benchmarksTexts

NitiBench

A benchmark for legal question answering. The data only contains test set and got two splits: NitiBench-CCL representing Thai corporate and commercial law, and NitiBench-Tax containing official tax ruling in Thai scraped from official Revenue Department website.

1 papers0 benchmarksTexts

WangchanX-Legal-ThaiCCL-RAG

The WangchanX-Legal-ThaiCCL-RAG dataset supports the development Retrieval-Augmented Generation (RAG) for Thai Legal question answering. This dataset is allows developers to finetune both retrieval model - to better retrieve relevant law section, and Large Language Model (LLM) - for instruction tuning. Our dataset supports Corporate and Commercial Law (thus ThaiCCL name)

1 papers0 benchmarksTexts

vpfrc_llm_vulnerability_classifier (VPFRC LLM Vulnerability Classifier Data)

LLM-Based Vulnerability Classification in Police Narratives This repository contains datasets used in our research on applying large language models (LLMs) to identify indicators of vulnerability in police incident narratives. These resources support the replication of findings in our paper: "Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives."

1 papers0 benchmarksTexts

simco-comco

These datasets, ComCo and SimCo, designed for evaluating multi-object representation in Vision-Language Models (VLMs). These datasets provide controlled environments for analyzing model biases, object recognition, and compositionality in multi-object scenarios.

1 papers0 benchmarksImages, Texts

LegalCore

Recognizing events and their coreferential men- tions in a document is essential for understand- ing semantic meanings of text. The existing re- search on event coreference resolution is mostly limited to news articles. In this paper, we present the first dataset for the legal domain, LegalCore, which has been annotated with comprehensive event and event coreference in- formation. The legal contract documents we an- notated in this dataset are several times longer than news articles, with an average length of around 25k tokens per document. The anno- tations show that legal documents have dense event mentions and feature both short-distance and super long-distance coreference links be- tween event mentions. We further benchmark mainstream Large Language Models (LLMs) on this dataset for both event identification and event coreference resolution tasks, and find that this dataset poses significant challenges for both open-source and proprietary LLMs, which all perform significantly worse

1 papers0 benchmarksTexts

Suicidial Annotation

A large dataset of around 40000 Reddit posts was collected from r/suicidewatch and other non-suicidal subreddits. The posts collected from r/suicidewatch are annotated with suicidal and other posts collected from a variety of groups like r/sports, r/anxiety, r/politics, and more are annotated with non-suicidal. Then this dataset has been used to feed various advanced deep-learning models to report a comparative evaluation of these models.

1 papers0 benchmarksTexts

VPData

The largest video inpainting dataset comprises over 390K clips (> 866.7 hours), featuring precise masks and detailed video captions.

1 papers0 benchmarksRGB Video, Texts, Tracking, Videos

VPBench

The benchmark for VPData, the largest video inpainting dataset, which comprises over 390K clips (> 866.7 hours) and features precise masks and detailed video captions.

1 papers0 benchmarksRGB Video, Texts, Tracking, Videos

IMPACT Patent (A Large-scale Integrated Multimodal Patent Analysis and Creation Dataset for Design Patents)

It is a large-scale multimodal patent dataset with detailed captions for design patent figures.

1 papers1 benchmarksImages, Texts

AI Conversational Interviewing: Interview data

Replication Material This document contains the necessary materials and instructions to replicate the findings presented in our paper. We provide comprehensive information on the data sources, code, and analytical procedures used in our study. The replication package includes raw data files, data cleaning scripts, and analysis code. We encourage users to contact us with any questions or issues encountered during the replication process.

1 papers0 benchmarksTexts

PreviousPage 147 of 158Next