Datasets

19,997 machine learning datasets

19,997 dataset results

SSCBench

SSCBench establishes a large-scale SSC benchmark in street views that facilitates the training of robust and generalizable SSC models. Overall, SSCBench consists of three subsets, including 38,562 frames for training, 15,798 frames for validation, and 12,553 frames for testing respectively, amounting totally to 66,913 frames.

13 papers0 benchmarks

LOLv2

The real captured dataset of LOL contains 500 low/normallight image pairs. Most low-light images are collected by changing exposure time and ISO, while other configurations of the cameras are fixed. We capture images from a variety of scenes, e.g., houses, campuses, clubs, streets.

13 papers3 benchmarks

Badminton

This dataset was introduced by the TrackNetV2 work. Following the dataset split defined by the authors, we use all the clips from 26 matches as a training set and the remaining 3 matches as a testing set.

13 papers3 benchmarks

EQ-Bench

This dataset contains benchmark scores for EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in Large Language Models (LLMs). We assess the ability of LLMs to understand complex emotions and social interactions by asking them to predict the intensity of emotional states of characters in a dialogue. The benchmark is able to discriminate effectively between a wide range of models. We find that EQ-Bench correlates strongly with comprehensive multi-domain benchmarks like MMLU (Hendrycks et al., 2020) (r=0.97), indicating that we may be capturing similar aspects of broad intelligence. Our benchmark produces highly repeatable results using a set of 60 English-language questions. We also provide open-source code for an automated benchmarking pipeline at https://github.com/EQ-bench/EQ-Bench and a leaderboard at https://www.eqbench.com.

13 papers1 benchmarksRanking

EPRSTMT (E-commerce Product Review Dataset for Sentiment Analysis)

The EPRSTMT dataset, also known as EPR-sentiment, is a binary sentiment analysis dataset based on product reviews on an e-commerce platform. Each sample in the dataset is labeled as either Positive or Negative. It was collected by the ICIP Lab of Beijing Normal University and has been re-organized to make it suitable for sentiment analysis tasks.

13 papers0 benchmarks

ASAP (Aligned Scores and Performances)

ASAP is a dataset of 222 digital musical scores aligned with 1068 performances (more than 92 hours) of Western classical piano music.

13 papers2 benchmarksAudio, Midi, Music

The COLOSSEUM (The COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation)

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup.

13 papers1 benchmarksImages, Texts

SARDet-100K

The SARDet-100K dataset encompasses a total of 116,598 images, and 245,653 instances distributed across six categories: Aircraft, Ship, Car, Bridge, Tank, and Harbor. SARDet100K dataset stands as the first large-scale SAR object detection dataset, comparable in size to the widely used COCO dataset (118K images). The scale and diversity of the SARDet-100K dataset provide researchers with robust training and evaluation for advancing SAR object detection algorithms and techniques, fostering the development of SOTA models in this domain.

13 papers4 benchmarksGraphs

SciEval

SciEval is a comprehensive and multi-disciplinary evaluation benchmark designed to assess the performance of large language models (LLMs) in the scientific domain. It addresses several critical issues related to evaluating LLMs for scientific research.

13 papers0 benchmarks

SAP

The SAP benchmark is a significant development in the realm of attack prompt generation for red teaming and defending large language models (LLMs). Let's delve into the details:

13 papers0 benchmarks

HEIM (Holistic Evaluation of Text-to-Image Models)

HEIM stands for Holistic Evaluation of Text-To-Image Models. It is a comprehensive benchmark designed to assess the capabilities and risks of text-to-image generation models. Unlike previous evaluations that primarily focused on image-text alignment and image quality, HEIM considers 12 different aspects that are crucial for real-world model deployment:

13 papers0 benchmarks

MedConceptsQA

MedConceptsQA - Open Source Medical Concepts QA Benchmark

13 papers3 benchmarksTexts

GenAI-Bench (GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation)

GenAI-Bench is a benchmarking framework designed to evaluate and improve compositional text-to-visual generation models. It was developed by researchers from Carnegie Mellon University and Meta. The key aspects of GenAI-Bench include:

13 papers0 benchmarks

Touch and Go

This dataset encompasses a diverse range of tactile features that are instrumental in bifurcating various material properties. Three downstream tasks are considered: 1) categorization of materials, 2) distinction between hard and soft surfaces, and 3) distinction between smooth and textured surfaces.

13 papers0 benchmarksImages, Videos

CROHME 2019

Source: ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection

13 papers1 benchmarks

Amazon Baby (Amazon Baby 5-core)

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

13 papers4 benchmarksImages, Texts

NTU4DRadLM

NTU4DRadLM is a novel 4D radar dataset specifically proposed for research on robust SLAM, based on 4D radar, thermal camera, and IMU. Totally, the dataset is around 17.6km, 85mins, 50GB.

13 papers0 benchmarks

Ghostbuster

The Ghostbusters dataset leverages the GPT-3.5-turbo model for generating texts in the domains of creative writing, news, and student essays, providing 2,000 texts in the first two domains and 1,994 in the latter.

13 papers0 benchmarksTexts

OUTFOX

It contains 15K triplets of essay problem statements, student-written, and LLM-generated essays.

13 papers0 benchmarksTexts

Vidore (Visual Document Retrieval Benchmark)

It is collection regrouping all datasets constituting the ViDoRe benchmark. It includes the test sets from different academic datasets (ArXiVQA, DocVQA, InfoVQA, TATDQA, TabFQuAD) and from datasets synthetically generated spanning various themes and industrial applications: (Artificial Intelligence, Government Reports, Healthcare Industry, Energy and Shift Project). Further details can be found on the corresponding dataset cards.

13 papers0 benchmarks

PreviousPage 136 of 1000Next