Datasets

3,148 machine learning datasets

3,148 dataset results

LongForm

LongForm dataset is created by leveraging English corpus examples with augmented instructions. It contains diverse set of human-written documents from existing corpora such as C4 and Wikipedia and generate instructions for the given documents via LLMs. The examples generated from raw text corpora via LLMs includes structured corpus examples, as well as various NLP task examples such as email writing, grammar error correction, story/poem generation, and text summarization.

8 papers0 benchmarksTexts

WebUI

The WebUI dataset contains 400K web UIs captured over a period of 3 months and cost about $500 to crawl. We grouped web pages together by their domain name, then generated training (70%), validation (10%), and testing (20%) splits. This ensured that similar pages from the same website must appear in the same split. We created four versions of the training dataset. Three of these splits were generated by randomly sampling a subset of the training split: Web-7k, Web-70k, Web-350k. We chose 70k as a baseline size, since it is approximately the size of existing UI datasets. We also generated an additional split (Web-7k-Resampled) to provide a small, higher quality split for experimentation. Web-7k-Resampled was generated using a class-balancing sampling technique, and we removed screens with possible visual defects (e.g., very small, occluded, or invisible elements). The validation and test split was always kept the same.

8 papers0 benchmarksImages, Texts

AmaSum

AmaSum is the largest abstractive opinion summarization dataset, consisting of more than 33,000 human-written summaries for Amazon products. Each summary is paired, on average, with more than 320 customer reviews. Summaries consist of verdicts, pros, and cons, see the example below.

8 papers0 benchmarksTexts

CAD (Contextual Abuse Dataset)

Dataset of primarily English Reddit entries which addresses several limitations of prior work. It (1) contains six conceptually distinct primary categories as well as secondary categories, (2) has labels annotated in the context of the conversation thread, (3) contains rationales and (4) uses an expert-driven group-adjudication process for high quality annotations.

8 papers2 benchmarksTexts

SciGraphQA

SciGraphQA is a large-scale, open-domain dataset focused on generating multi-turn conversational question-answering dialogues centered around understanding and describing scientific graphs and figures. It contains over 300,000 samples derived from academic research papers in computer science and machine learning domains.

8 papers0 benchmarksImages, Texts

T$^3$Bench

T$^3$Bench is the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation (300 prompts in total). To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and Large Language Model (LLM) evaluation to measure text-3D consistency.

8 papers3 benchmarks3D, Texts

WANDS (Wayfair ANnotation Dataset)

The dataset contains:

8 papers0 benchmarksTexts

GQA-REX

A GQA-based dataset with 1,040,830 multi-modal explanations of visual reasoning processes.

8 papers24 benchmarksImages, Texts

COCO-MIG (COCO-MIG benchmark)

The COCO-MIG benchmark (Common Objects in Context Multi-Instance Generation) is a benchmark used to evaluate the generation capability of generators on text containing multiple attributes of multi-instance objects. This benchmark consists of 800 sets of examples sampled from the COCO dataset. Following the layout of the COCO dataset, each instance is assigned random color information, and corresponding global image descriptions are constructed according to templates. The COCO-MIG also provides a complete pipeline for resampling and evaluating. For relevant tools and specific details, please refer to our project's homepage.

8 papers8 benchmarksImages, Texts

VideoXum

VideoXum is an enriched large-scale dataset for cross-modal video summarization. The dataset is built on ActivityNet Captions. The datasets includes three subtasks: Video-to-Video Summarization (V2V-SUM), Video-to-Text Summarization (V2T-SUM), and Video-to-Video&Text Summarization (V2VT-SUM).

8 papers0 benchmarksTexts, Videos

Arena-Hard-Auto

The Arena-Hard-Auto benchmark is an automatic evaluation tool for instruction-tuned Language Learning Models (LLMs)¹. It was developed to provide a cheaper and faster approximation to human preference¹.

8 papers0 benchmarksTexts

GTA (A Benchmark for General Tool Agents)

A benchmark to evaluate the tool-use capabilities of LLM-based agents in real-world scenarios.

8 papers0 benchmarksImages, Texts

10 Synthetic Genomics Datasets

These are 10 synthetic genomics datasets generated with NEAT v3 (based on TP53 gene of Homo Sapiens) for the use case of benchmarking somatic variant callers. To find more about our generating framework please visit synth4bench GitHub repository.

8 papers2 benchmarksBiomedical, Texts

RFUND (Revised FUNSD and XFUND)

RFUND is a relabeled version of FUNSD and XFUND datasets, tackling the following issues in their original annotations:

8 papers0 benchmarksImages, Texts

RFUND-EN

English subset of RFUND

8 papers1 benchmarksImages, Texts

Amazon Toys & Games (Amazon Toys & Games 5-core)

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

8 papers3 benchmarksImages, Texts

Spider 2.0

Spider 2.0 is a comprehensive code generation agent task that includes 632 examples. The agent has to interactively explore various types of databases, such as BigQuery, Snowflake, Postgres, ClickHouse, DuckDB, and SQLite. It is required to engage with complex SQL workflows, process extensive contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines across multiple interactions.

8 papers2 benchmarksEnvironment, Texts

OpenS2V-Eval

OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics: NexusScore, NaturalScore, GmeScore to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 14 representative S2V models, highlighting their strengths and weaknesses across different content.

8 papers32 benchmarksImages, Texts, Videos

WeChat

The WeChat dataset for fake news detection contains more than 20k news labelled as fake news or not.

7 papers2 benchmarksTexts

MedDialog

The MedDialog dataset (Chinese) contains conversations (in Chinese) between doctors and patients. It has 1.1 million dialogues and 4 million utterances. The data is continuously growing and more dialogues will be added. The raw dialogues are from haodf.com. All copyrights of the data belong to haodf.com.

7 papers0 benchmarksMedical, Texts

PreviousPage 52 of 158Next