Datasets

3,148 machine learning datasets

3,148 dataset results

ChronoMagic-Pro

Description

2 papers0 benchmarksTexts, Videos

CoNLL-2020 (CoNLLpp)

A test dataset that annotated articles in 2020 following the CoNLL-2003 NER task.

2 papers1 benchmarksTexts

Planetarium

145k natural language and PDDL problem pairs from the Blocks World, Gripper, and Floor Tile domains.

2 papers0 benchmarksTexts

Street360Loc

Text-Vison Cross-Modal Place Recognition Dataset

2 papers0 benchmarksImages, Texts

We collect a dataset of Rich Human Feedback on 18K images (RichHF-18K), which contains (i) point annotations on the image that highlight regions of implausibility/artifacts, and text-image misalignment; (ii) labeled words on the prompts specifying the missing or misrepresented concepts in the generated image; and (iii) four types of fine-grained scores for image plausibility, text-image alignment, aesthetics, and overall rating.

2 papers0 benchmarksImages, Texts

VerilogEval

VerilogEval Dataset The VerilogEval Dataset is a benchmark specifically designed to assess the ability of large language models (LLMs) to generate syntactically correct and functionally accurate Verilog code. Introduced in the paper VerilogEval: Evaluating Large Language Models for Verilog Code Generation, it has become a cornerstone for research in hardware code generation.

2 papers1 benchmarksTexts

LongVALE

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language mod

2 papers0 benchmarksAudio, Speech, Texts, Videos

M2QA (Multi-domain Multilingual Question Answering)

M2QA (Multi-domain Multilingual Question Answering) is an extractive question answering benchmark for evaluating joint language and domain transfer. M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing. 40% of the data are unanswerable questions, 60% are answerable.

2 papers0 benchmarksTexts

Kaleidoscope (Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation)

The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages

2 papers0 benchmarksImages, Texts

CLEAR-Bias (Corpus for Linguistic Evaluation of Adversarial Robustness against Bias)

CLEAR-Bias is a benchmark dataset designed to evaluate the robustness of large language models (LLMs) against bias elicitation, particularly under adversarial conditions. It comprises 4,400 prompts across two task formats: multiple-choice and sentence completion. These prompts span seven core bias categories—age, disability, ethnicity, gender, religion, sexual orientation, and socioeconomic status—as well as three intersectional categories, enabling the exploration of overlapping social biases often overlooked in standard evaluations. Each category includes 20 carefully crafted base prompts (10 per task type), which are further expanded using seven jailbreak techniques: machine translation, obfuscation, prefix and prompt injection, refusal suppression, reward incentives, and role-playing—each implemented with three variants.

2 papers0 benchmarksTexts

RuSentNE (RuSentNE-2023)

https://github.com/dialogue-evaluation/RuSentNE-evaluation

2 papers0 benchmarksTexts

Beacon3D

Dataset of the Beacon3D benchmark: Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis.

2 papers0 benchmarks3D, Texts

TIME (\textsc{TimE}: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenario)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarksTexts

ChinaTravel

We introduce ChinaTravel, the first open-ended benchmark grounded in authentic Chinese travel requirements collected from 1,154 human participants. We design a compositionally generalizable domain-specific language (DSL) for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison.

2 papers0 benchmarksTexts

BarNER

Named entities in Bavarian text

2 papers0 benchmarksTexts

QUITE (Quantifying Uncertainty in natural language Text)

QUITE (Quantifying Uncertainty in natural language Text) is an entirely new benchmark that allows for assessing the capabilities of neural language model-based systems w.r.t. to Bayesian reasoning on a large set of input text that describes probabilistic relationships in natural language text.

2 papers0 benchmarksTexts

Paper Field

Paper Field is built from the Microsoft Academic Graph and maps paper titles to one of 7 fields of study. Each field of study - geography, politics, economics, business, sociology, medicine, and psychology - has approximately 12K training examples.

1 papers3 benchmarksTexts

Helsinki Prosody Corpus

The Helsinki Prosody Corpus is a dataset for predicting prosodic prominence from written text. The prosodic annotations are automatically generated, high quality prosodic for the 'clean' subsets of LibriTTS corpus (Zen et al., 2019), comprising of 262.5 hours of read speech from 1230 speakers. The transcribed sentences were aligned and then prosodically annotated with word-level acoustic prominence labels.

1 papers1 benchmarksTexts

ASLG-PC12 (English-ASL Gloss Parallel Corpus 2012)

An artificial corpus built using grammatical dependencies rules due to the lack of resources for Sign Language.

1 papers1 benchmarksTexts

WebEdit

Fact-based Text Editing dataset based on WebNLG dataset.

1 papers9 benchmarksTabular, Texts

PreviousPage 101 of 158Next