Datasets

3,148 machine learning datasets

3,148 dataset results

PTVD

PTVD is a plot-oriented multimodal dataset in the TV domain. It is also the first non-English dataset of its kind. Additionally, PTVD contains more than 26 million bullet screen comments (BSCs), powering large-scale pre-training.

1 papers0 benchmarksImages, Texts, Videos

UTRSet-Real

The UTRSet-Real dataset is a comprehensive, manually annotated dataset specifically curated for Printed Urdu OCR research. It contains over 11,000 printed text line images, each of which has been meticulously annotated. One of the standout features of this dataset is its remarkable diversity, which includes variations in fonts, text sizes, colours, orientations, lighting conditions, noises, styles, and backgrounds. This diversity closely mirrors real-world scenarios, making the dataset highly suitable for training and evaluating models that aim to excel in real-world Urdu text recognition tasks.

1 papers0 benchmarksImages, Texts

UTRSet-Synth

The UTRSet-Synth dataset is introduced as a complementary training resource to the UTRSet-Real Dataset, specifically designed to enhance the effectiveness of Urdu OCR models. It is a high-quality synthetic dataset comprising 20,000 lines that closely resemble real-world representations of Urdu text.

1 papers0 benchmarksImages, Texts

UrduDoc

The UrduDoc Dataset is a benchmark dataset for Urdu text line detection in scanned documents. It is created as a byproduct of the UTRSet-Real dataset generation process. Comprising 478 diverse images collected from various sources such as books, documents, manuscripts, and newspapers, it offers a valuable resource for research in Urdu document analysis. It includes 358 pages for training and 120 pages for validation, featuring a wide range of styles, scales, and lighting conditions. It serves as a benchmark for evaluating printed Urdu text detection models, and the benchmark results of state-of-the-art models are provided. The Contour-Net model demonstrates the best performance in terms of h-mean.

1 papers2 benchmarksImages, Texts

Dissonance Twitter Dataset

Dissonance Twitter Dataset is a dataset collected from annotating tweets for dissonance.

1 papers0 benchmarksTexts

Simple Shapes Dataset

It consists of 32x32 pixel images of shapes with multiple attributes (size, location, rotation, color). Each image is also paired with its ground truth information (attributes), and a natural language description (English) of the image.

1 papers0 benchmarksImages, Texts

SUDOER (System/User Dataset for Obedience Evaluation in Responses)

The dataset aims to provide system prompts and user prompts for assistant. You should make random pairs and compute human preference for both system prompt obedience and user prompt relevance through A/B testing.

1 papers0 benchmarksTexts

satp-zsm-stage2 (Replication data for Crossing the Linguistic Causeway: Ethnonational Differences on Soundscape Attributes in Bahasa Melayu)

This is the replication data for the paper: "Crossing the Linguistic Causeway: Ethnonational Differences on Soundscape Attributes in Bahasa Melayu".

1 papers0 benchmarksTexts

ConvSumX

ConvSumX is a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context. ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions.

1 papers0 benchmarksTexts

Subjective Perception of Active Noise Reduction (SPANR) (Replication Data for: Anti-noise window: subjective perception of active noise reduction and effect of informational masking)

This repository contains replication data to the paper titled: "Anti-noise window: subjective perception of active noise reduction and effect of informational masking"

1 papers0 benchmarksTexts

COLLIE-v1

COLLIE-v1 is a dataset with 2080 instances comprising 13 constraint structures designed for text generation under constraints. It is a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels (word, sentence, paragraph, passage).

1 papers0 benchmarksTexts

REFCAT (Internet Archive Scholar Reference Dataset)

Internet Archive Scholar Reference Dataset.

1 papers0 benchmarksTexts

CoverageEval

CoverageEval is a dataset specifically designed for evaluating LLMs on this task. To create CoverageEval, we parse the code coverage logs generated during the execution of the test cases. This parsing step enables us to extract the relevant coverage annotations. We then carefully structure and export the dataset in a format that facilitates its use and evaluation by researchers and practitioners alike.

1 papers0 benchmarksTexts

DEplain-APA-doc

DEplain-APA-doc: A German Parallel Corpus for Document Simplification on News Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

1 papers4 benchmarksTexts

DEplain-web-doc

DEplain-web-doc: A German Parallel Corpus for Document Simplification on Web Texts DEplain is a new dataset of parallel, professionally written and manually aligned simplifications in plain German “plain DE” (or in German: “Einfache Sprache”). DEplain consists of four main subcorpora: DEplain-APA-doc, DEplain-APA-sent, DEplain-web-doc, and DEplain-web-sent.

1 papers4 benchmarksTexts

MineralImage5k (Benchmark for 5k raw mineral species recognition)

We present a comprehensive dataset comprising a vast collection of raw mineral samples for the purpose of mineral recognition. The dataset encompasses more than 5,000 distinct mineral species and incorporates subsets for zero-shot and few-shot learning. In addition to the samples themselves, some entries in the dataset are accompanied by supplementary natural language descriptions, size measurements, and segmentation masks. For detailed information on each sample, please refer to the minerals_full.csv file.

1 papers0 benchmarksImages, Tables, Texts

LLMs4OL Evaluation Datasets

Three tasks were addressed in the LLMs4OL paradigm. The datasets released address the three tasks respectively. They are as follows:

1 papers0 benchmarksTexts

AISECKG (AISecKG: Knowledge Graph Dataset for Cybersecurity Education)

Cybersecurity education is exceptionally challenging as it involves learning the complex attacks; tools and developing critical problem-solving skills to defend the systems. For a student or novice researcher in the cybersecurity domain, there is a need to design an adaptive learning strategy that can break complex tasks and concepts into simple representations. An AI-enabled automated cybersecurity education system can improve cognitive engagement and active learning. Knowledge graphs (KG) provide a visual representation in a graph that can reason and interpret from the underlying data, making them suitable for use in education and interactive learning. However, there are no publicly available datasets for the cybersecurity education domain to build such systems. The data is present as unstructured educational course material, Wiki pages, capture the flag (CTF) writeups, etc. Creating knowledge graphs from unstructured text is challenging without an ontology or annotated dataset. Howe

1 papers0 benchmarksTexts

In-the-wild ChatGPT Prompts

This dataset contains 6,387 ChatGPT prompts collected from four platforms (Reddit, Discord, websites, and open-source datasets) during Dec 2022 to May 2023. Among these prompts, 666 jailbreak prompts are identified.

1 papers0 benchmarksTexts

SHADR (sythetic SDoH Human Annotated Demographic Robustness dataset (SHADR))

SDoH Human Annotated Demoographic Robustness (SHADR) Dataset Overview The Social determinants of health (SDoH) play a pivotal role in determining patient outcomes. However, their documentation in electronic health records (EHR) remains incomplete. This dataset was created from a study examining the capability of large language models in extracting SDoH from the free text sections of EHRs. Furthermore, the study delved into the potential of synthetic clinical text to bolster the extraction process of these scarcely documented, yet crucial, clinical data.

1 papers0 benchmarksTexts

PreviousPage 128 of 158Next