3,148 machine learning datasets
3,148 dataset results
An evaluation dataset for planning with LLM agents
MapEval contains 700 question-answer pairs.
MapEval-Textual contains 300 context-question-answer triplets. The necessary geo-spatial information is provided in the context. The task is to answer question based on the factual data provided in the context.
MapEval-Visual contains 400 image-question-answer triplets. Each question is paired with a snapshot from google maps website. The task is the answer question based on the provided map snapshot.
MapEval-Textual contains 300 question-answer pairs. The task is to answer question by fetching necessary informations using external Map APIs.
This is the official dataset for PRMBench. PRMBench is a benchmark dataset for evaluating process-level reward models (PRMs). It consists of 6,216 data instances, each containing a question, a solution process, and a modified process with errors. The dataset is designed to evaluate the ability of PRMs to identify fine-grained error types in the solution process. The dataset is annotated with error types and reasons for the errors, providing a comprehensive evaluation of PRMs.
SPIQA Dataset Card Dataset Details Dataset Name: SPIQA (Scientific Paper Image Question Answering)
Translated SNLI Dataset in Marathi A translated version of the SNLI dataset in Marathi, designed for Semantic Textual Similarity (STS) tasks. The translations were generated using the model aryaumesh/english-to-marathi.
Source: Linking Datasets on Organizations Using Half-a-Billion Open-Collaborated Records (Description (Markdown and LATEX enabled))
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This is a dataset for 3-way sentiment classification of reviews (negative, neutral, positive). It is a merge of Stanford Sentiment Treebank (SST-3) and DynaSent Rounds 1 and 2, licensed under Apache 2.0 and Creative Commons Attribution 4.0 respectively. The SST-3, DynaSent R1, and DynaSent R2 datasets were randomly mixed to form a new dataset with 102,097 Train examples, 5,421 Validation examples, and 6,530 Test examples. See Table 1 for the distribution of labels within this merged dataset.
This dataset comprises 77,175 Reddit posts from 115 subreddit forums, annotated for the presence of 15 topics related to eating disorders and dieting. The dataset includes labels and scores on all 77,175 Reddit posts, determined by 5 Large Language Models: GPT-4o, Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Vicuna-7b-v1.5, as well as by the ensemble of the four open-source LLMs. The dataset also includes a subset of 1,080 human-annotated posts for evaluation.
This dataset is the translation of the MS-marco dataset, marking it the first large-scale urdu IR dataset.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Dataset for testing the ability of Vision Language Models (LVM) to recognize and match 3D objects of the exact same 3D shapes but with different orientation/materials/textures/ environments and light conditions.
M²ConceptBase is a concept-centric multimodal knowledge base designed to bridge the gap between visual and linguistic semantics. It features 951K images and 152K concepts, with each concept linked to an average of 6.27 images and a detailed textual description.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
JustLogic is a natural language deductive reasoning dataset. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy.
Question answering over temporal knowledge graphs (TKGs) is crucial for understanding evolving facts and relationships, yet its development is hindered by limited datasets and difficulties in generating custom QA pairs. We propose a novel categorization framework based on timeline-context relationships, along with \textbf{TimelineKGQA}, a universal temporal QA generator applicable to any TKGs. The code is available at: \url{https://github.com/PascalSun/TimelineKGQA} as an open source Python package.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).