Datasets

3,148 machine learning datasets

3,148 dataset results

Plancraft

An evaluation dataset for planning with LLM agents

1 papers0 benchmarksEnvironment, Images, Texts

MapEval

MapEval contains 700 question-answer pairs.

MapEval-Textual contains 300 context-question-answer triplets. The necessary geo-spatial information is provided in the context. The task is to answer question based on the factual data provided in the context.

1 papers1 benchmarksTexts

MapEval-Visual

MapEval-Visual contains 400 image-question-answer triplets. Each question is paired with a snapshot from google maps website. The task is the answer question based on the provided map snapshot.

1 papers2 benchmarksImages, Texts

MapEval-API

MapEval-Textual contains 300 question-answer pairs. The task is to answer question by fetching necessary informations using external Map APIs.

1 papers1 benchmarksTexts

PRMBench_Preview

This is the official dataset for PRMBench. PRMBench is a benchmark dataset for evaluating process-level reward models (PRMs). It consists of 6,216 data instances, each containing a question, a solution process, and a modified process with errors. The dataset is designed to evaluate the ability of PRMs to identify fine-grained error types in the solution process. The dataset is annotated with error types and reasons for the errors, providing a comprehensive evaluation of PRMs.

1 papers0 benchmarksTexts

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

SPIQA Dataset Card Dataset Details Dataset Name: SPIQA (Scientific Paper Image Question Answering)

1 papers0 benchmarksImages, Texts

Translated SNLI Dataset in Marathi

Translated SNLI Dataset in Marathi A translated version of the SNLI dataset in Marathi, designed for Semantic Textual Similarity (STS) tasks. The translations were generated using the model aryaumesh/english-to-marathi.

1 papers2 benchmarksTexts

Data for: "Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records"

Source: Linking Datasets on Organizations Using Half-a-Billion Open-Collaborated Records (Description (Markdown and LATEX enabled))

1 papers0 benchmarksGraphs, Texts

Replication Data for: Can Large Language Models (or Humans) Disentangle Text?

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

Sentiment Merged (SST-3, DynaSent R1/R2)

This is a dataset for 3-way sentiment classification of reviews (negative, neutral, positive). It is a merge of Stanford Sentiment Treebank (SST-3) and DynaSent Rounds 1 and 2, licensed under Apache 2.0 and Creative Commons Attribution 4.0 respectively. The SST-3, DynaSent R1, and DynaSent R2 datasets were randomly mixed to form a new dataset with 102,097 Train examples, 5,421 Validation examples, and 6,530 Test examples. See Table 1 for the distribution of labels within this merged dataset.

1 papers1 benchmarksTexts

Reddit Posts Related To Eating Disorders and Dieting (Topic Annotations on Reddit Posts from Eating Disorders and Dieting Forums by Human and LLMs)

This dataset comprises 77,175 Reddit posts from 115 subreddit forums, annotated for the presence of 15 topics related to eating disorders and dieting. The dataset includes labels and scores on all 77,175 Reddit posts, determined by 5 Large Language Models: GPT-4o, Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Vicuna-7b-v1.5, as well as by the ensemble of the four open-source LLMs. The dataset also includes a subset of 1,080 human-annotated posts for evaluation.

1 papers0 benchmarksMedical, Texts

Urdu MsMarco

This dataset is the translation of the MS-marco dataset, marking it the first large-scale urdu IR dataset.

1 papers0 benchmarksTexts

SusGen-30k

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

Visual 3D shape matching dataset

Dataset for testing the ability of Vision Language Models (LVM) to recognize and match 3D objects of the exact same 3D shapes but with different orientation/materials/textures/ environments and light conditions.

1 papers0 benchmarksImages, Texts

M²ConceptBase

M²ConceptBase is a concept-centric multimodal knowledge base designed to bridge the gap between visual and linguistic semantics. It features 951K images and 152K concepts, with each concept linked to an average of 6.27 images and a detailed textual description.

1 papers0 benchmarksImages, Texts

PatentDesc-355K

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksImages, Texts

JustLogic

JustLogic is a natural language deductive reasoning dataset. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy.

1 papers0 benchmarksTexts

TimelineKGQA

Question answering over temporal knowledge graphs (TKGs) is crucial for understanding evolving facts and relationships, yet its development is hindered by limited datasets and difficulties in generating custom QA pairs. We propose a novel categorization framework based on timeline-context relationships, along with \textbf{TimelineKGQA}, a universal temporal QA generator applicable to any TKGs. The code is available at: \url{https://github.com/PascalSun/TimelineKGQA} as an open source Python package.

1 papers0 benchmarksTexts

Beemo (Benchmark of expert-edited machine-generated outputs)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

PreviousPage 145 of 158Next