Datasets

3,148 machine learning datasets

3,148 dataset results

Song Describer Dataset

The Song Describer Dataset (SDD) contains ~1.1k captions for 706 permissively licensed music recordings. It is designed for use in evaluation of models that address music-and-language (M&L) tasks such as music captioning, text-to-music generation and music-language retrieval.

5 papers2 benchmarksAudio, Music, Texts

Test-of-Time (Test of Time Synthetic Video Dataset)

The goal of this dataset is to probe video-language models for understanding of simple temporal relations like "before" and "after". The dataset is only meant to be an evaluation set and not a training set.

5 papers2 benchmarksTexts, Videos

MatSynth

MatSynth MatSynth is a Physically Based Rendering (PBR) materials dataset designed for modern AI applications. This dataset consists of over 4,000 ultra-high resolution, offering unparalleled scale, diversity, and detail.

5 papers0 benchmarks3D, Images, Texts

SAFIM (Syntax-Aware Fill-In-the-Middle)

Syntax-Aware Fill-in-the-Middle (SAFIM) is a benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. SAFIM has three subtasks: Algorithmic Block Completion, Control-Flow Expression Completion, and API Function Call Completion. SAFIM is sourced from code submitted from April 2022 to January 2023 to minimize the impact of data contamination on evaluation results.

5 papers4 benchmarksTexts

CC3M-TagMask

The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that aptly describe the relationship between the image and the corresponding text. These annotations provide valuable insights into the semantic connection between each pair's visual and textual elements.

5 papers17 benchmarksImages, Texts

ChronoMagic

ChronoMagic with 2265 metamorphic time-lapse videos, each accompanied by a detailed caption.

5 papers0 benchmarksTexts, Videos

RTL-Repo

RTL-Repo is a benchmark for evaluating LLMs' effectiveness in generating Verilog code autocompletions within large, complex codebases. It assesses the model's ability to understand and remember the entire Verilog repository context and generate new code that is correct, relevant, logically consistent, and adherent to coding conventions and guidelines, while being aware of all components and modules in the project. This provides a realistic evaluation of a model's performance in real-world RTL design scenarios. RTL-Repo comprises over 4000 code samples from GitHub repositories, each containing the context of all Verilog code in the repository, offering a valuable resource for the hardware design community to assess and train LLMs for Verilog code generation in complex, multi-file RTL projects.

5 papers0 benchmarksTexts

SkyEye-968k

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

5 papers0 benchmarksImages, Texts

SugarCrepe++

The SUGARCREPE++ dataset evaluates the sensitivity of vision language models (VLMs) and unimodal language models (ULMs) to semantic and lexical alterations. Each sample in the SugarCrepe++ dataset consists of an image and a corresponding triplet of captions: a pair of semantically equivalent but lexically different positive captions and one hard negative caption. This poses a 3-way semantic (in)equivalence problem to the language models. The SUGARCREPE dataset consists of (only) one positive and one hard negative caption for each image. Relative to the negative caption, a single positive caption can either have low or high lexical overlap. The original SUGARCREPE only captures the high overlap case. To evaluate the sensitivity of encoded semantics to lexical alteration, we require an additional positive caption with a different lexical composition. SUGARCREPE++ fills this gap by adding an additional positive caption enabling a more thorough assessment of models’ abilities to handle se

5 papers0 benchmarksImages, Texts

MoA (MoA_Long_ModelQA)

This is the dataset used by the automatic sparse attention compression method MoA. It enhances the calibration dataset by integrating long-range dependencies and model alignment. MoA utilizes long-contextual datasets, which include question-answer pairs heavily dependent on long-range content.

5 papers0 benchmarksTexts

M3GIA

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

5 papers0 benchmarksImages, Texts

IAM(line-level) (Line-level Handwritten Text Recognition on IAM)

The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.

5 papers4 benchmarksImages, Texts

Loong

We propose a novel long-context benchmark, 🐉 Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Loong typically consists of 11 documents per test instance on average, spanning three real-world scenarios in English and Chinese: (1) Financial Reports, (2) Legal Cases, and (3) Academic Papers. Meanwhile, Loong introduces new evaluation tasks from the perspectives of Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic and comprehensive evaluation of long-context understanding. Furthermore, Loong features inputs of varying lengths (e.g., 10K-50K, 50K-100K, 100K-200K, beyond 200K) and evaluation tasks of diverse difficulty, enabling fine-grained assessment of LLMs across different context lengths and task complexities.

5 papers0 benchmarksTexts

Infinity-MM

We collect, organize and open-source the large-scale multimodal instruction dataset, Infinity-MM, consisting of tens of millions of samples. Through quality filtering and deduplication, the dataset has high quality and diversity. We propose a synthetic data generation method based on open-source models and labeling system, using detailed image annotations and diverse question generation.

5 papers0 benchmarksImages, Texts, Videos

HourVideo

We introduce HourVideo, a benchmark dataset for hour-long video-language understanding. HourVideo consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. We hope to establish HourVideo as a benchmark challenge to spur the development of advanced multimodal models capable of truly understanding endless streams of visual data.

5 papers0 benchmarksTexts, Videos

TextAtlasEval

A Dense-text Image Benchmark to evaluate large generation model's ability on text generation.

5 papers15 benchmarksImages, Texts

Open6DOR V2 (Benchmarking Open-instruction 6-DoF Object Rearrangement and A VLM-based Approach)

We introduce a challenging and comprehensive benchmark for open-instruction 6-DoF object rearrangement tasks, termed Open6DOR.

5 papers6 benchmarksImages, Texts

ImplicitQA

The ImplicitQA dataset was introduced in the paper ImplicitQA: Going beyond frames towards Implicit Video Reasoning.

5 papers0 benchmarksTexts, Videos

SCDE

SCDE is a human-created sentence cloze dataset, collected from public school English examinations in China. The task requires a model to fill up multiple blanks in a passage from a shared candidate set with distractors designed by English teachers.

4 papers3 benchmarksTexts

DDRel

DDRel is a dataset for interpersonal relation classification in dyadic dialogues. It consists of 6,300 dyadic dialogue sessions between 694 pairs of speakers with 53,126 utterances in total. It is constructed by crawling movie scripts from IMSDb and annotating the relation labels for each session according to 13 pre-defines relationships.

4 papers6 benchmarksTexts

PreviousPage 65 of 158Next