3,148 machine learning datasets
3,148 dataset results
We introduce FortisAVQA, a dataset designed to assess the robustness of AVQA models. Its construction involves two key processes: rephrasing and splitting. Rephrasing modifies questions from the test set of MUSIC-AVQA to enhance linguistic diversity, thereby mitigating the reliance of models on spurious correlations between key question terms and answers. Splitting entails the automatic and reasonable categorization of questions into frequent (head) and rare (tail) subsets, enabling a more comprehensive evaluation of model performance in both in-distribution and out-of-distribution scenarios.
The Storytelling Video Dataset is a high-quality, human-reviewed multimodal dataset featuring over 700 full-body video recordings of native Russian speakers. Each video is 10+ minutes long and includes synchronized speech, facial expressions, gestures, and emotional variation. The dataset is ideal for research and development in:
Artificial Relationships in Fiction Dataset Description Artificial Relationships in Fiction (ARF) is a synthetically annotated dataset for Relation Extraction (RE) in fiction, created from a curated selection of literary texts sourced from Project Gutenberg. The dataset captures the rich, implicit relationships within fictional narratives using a novel ontology and GPT-4o for annotation. ARF is the first large-scale RE resource designed specifically for literary texts, advancing both NLP model training and computational literary analysis.
We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly $35$ TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a $5.2$ TB carefully curated Chinese web corpus, a $22.5$ TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract $4.5$ billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse
Dataset Card for SENTINEL:<br> Mitigating Object Hallucinations via Sentence-Level Early Intervention <!-- omit in toc --> <a href='https://arxiv.org/abs/2507.12455'> <img src='https://img.shields.io/badge/Paper-Arxiv-purple'></a> <a href='https://github.com/pspdada/SENTINEL'> <img src='https://img.shields.io/badge/Github-Repo-Green'></a>
NHR-Edit is a training dataset for instruction-based image editing. Each sample consists of an input image, a natural language editing instruction, and the corresponding edited image. All samples are generated fully automatically using the NoHumanRequired pipeline, without any human annotation or filtering.
COCO-Facet is a benchmark for attribute-focused text-to-image retrieval, comprising 9,112 queries with 100 candidate images for each. The images are from COCO images, and the annotations are from available annotations of COCO images (COCO, Visual7W, VisDial, COCO-Stuff).
ArcBench is a logically challenging dataset of 158 English question–answer pairs, derived from the RoR-Bench benchmark. It targets deductive and multi-step reasoning in LLMs and multi-agent systems. The dataset was curated by translating original riddles into accessible English, removing multi-modal complexities, and validating each pair through automated reasoning workflows (e.g., Nexus Architect). ArcBench supports evaluation of reasoning performance, workflow refinement with feedback loops, comparative analysis of language models, and prompt-engineering research.