3,148 machine learning datasets
3,148 dataset results
Description
A test dataset that annotated articles in 2020 following the CoNLL-2003 NER task.
145k natural language and PDDL problem pairs from the Blocks World, Gripper, and Floor Tile domains.
Text-Vison Cross-Modal Place Recognition Dataset
We collect a dataset of Rich Human Feedback on 18K images (RichHF-18K), which contains (i) point annotations on the image that highlight regions of implausibility/artifacts, and text-image misalignment; (ii) labeled words on the prompts specifying the missing or misrepresented concepts in the generated image; and (iii) four types of fine-grained scores for image plausibility, text-image alignment, aesthetics, and overall rating.
VerilogEval Dataset The VerilogEval Dataset is a benchmark specifically designed to assess the ability of large language models (LLMs) to generate syntactically correct and functionally accurate Verilog code. Introduced in the paper VerilogEval: Evaluating Large Language Models for Verilog Code Generation, it has become a cornerstone for research in hardware code generation.
Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language mod
M2QA (Multi-domain Multilingual Question Answering) is an extractive question answering benchmark for evaluating joint language and domain transfer. M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing. 40% of the data are unanswerable questions, 60% are answerable.
The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages
CLEAR-Bias is a benchmark dataset designed to evaluate the robustness of large language models (LLMs) against bias elicitation, particularly under adversarial conditions. It comprises 4,400 prompts across two task formats: multiple-choice and sentence completion. These prompts span seven core bias categories—age, disability, ethnicity, gender, religion, sexual orientation, and socioeconomic status—as well as three intersectional categories, enabling the exploration of overlapping social biases often overlooked in standard evaluations. Each category includes 20 carefully crafted base prompts (10 per task type), which are further expanded using seven jailbreak techniques: machine translation, obfuscation, prefix and prompt injection, refusal suppression, reward incentives, and role-playing—each implemented with three variants.
https://github.com/dialogue-evaluation/RuSentNE-evaluation
Dataset of the Beacon3D benchmark: Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
We introduce ChinaTravel, the first open-ended benchmark grounded in authentic Chinese travel requirements collected from 1,154 human participants. We design a compositionally generalizable domain-specific language (DSL) for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison.
Named entities in Bavarian text
QUITE (Quantifying Uncertainty in natural language Text) is an entirely new benchmark that allows for assessing the capabilities of neural language model-based systems w.r.t. to Bayesian reasoning on a large set of input text that describes probabilistic relationships in natural language text.
Paper Field is built from the Microsoft Academic Graph and maps paper titles to one of 7 fields of study. Each field of study - geography, politics, economics, business, sociology, medicine, and psychology - has approximately 12K training examples.
The Helsinki Prosody Corpus is a dataset for predicting prosodic prominence from written text. The prosodic annotations are automatically generated, high quality prosodic for the 'clean' subsets of LibriTTS corpus (Zen et al., 2019), comprising of 262.5 hours of read speech from 1230 speakers. The transcribed sentences were aligned and then prosodically annotated with word-level acoustic prominence labels.
An artificial corpus built using grammatical dependencies rules due to the lack of resources for Sign Language.
Fact-based Text Editing dataset based on WebNLG dataset.