19,997 machine learning datasets
19,997 dataset results
SSCBench establishes a large-scale SSC benchmark in street views that facilitates the training of robust and generalizable SSC models. Overall, SSCBench consists of three subsets, including 38,562 frames for training, 15,798 frames for validation, and 12,553 frames for testing respectively, amounting totally to 66,913 frames.
The real captured dataset of LOL contains 500 low/normallight image pairs. Most low-light images are collected by changing exposure time and ISO, while other configurations of the cameras are fixed. We capture images from a variety of scenes, e.g., houses, campuses, clubs, streets.
This dataset was introduced by the TrackNetV2 work. Following the dataset split defined by the authors, we use all the clips from 26 matches as a training set and the remaining 3 matches as a testing set.
This dataset contains benchmark scores for EQ-Bench, a novel benchmark designed to evaluate aspects of emotional intelligence in Large Language Models (LLMs). We assess the ability of LLMs to understand complex emotions and social interactions by asking them to predict the intensity of emotional states of characters in a dialogue. The benchmark is able to discriminate effectively between a wide range of models. We find that EQ-Bench correlates strongly with comprehensive multi-domain benchmarks like MMLU (Hendrycks et al., 2020) (r=0.97), indicating that we may be capturing similar aspects of broad intelligence. Our benchmark produces highly repeatable results using a set of 60 English-language questions. We also provide open-source code for an automated benchmarking pipeline at https://github.com/EQ-bench/EQ-Bench and a leaderboard at https://www.eqbench.com.
The EPRSTMT dataset, also known as EPR-sentiment, is a binary sentiment analysis dataset based on product reviews on an e-commerce platform. Each sample in the dataset is labeled as either Positive or Negative. It was collected by the ICIP Lab of Beijing Normal University and has been re-organized to make it suitable for sentiment analysis tasks.
ASAP is a dataset of 222 digital musical scores aligned with 1068 performances (more than 92 hours) of Western classical piano music.
To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup.
The SARDet-100K dataset encompasses a total of 116,598 images, and 245,653 instances distributed across six categories: Aircraft, Ship, Car, Bridge, Tank, and Harbor. SARDet100K dataset stands as the first large-scale SAR object detection dataset, comparable in size to the widely used COCO dataset (118K images). The scale and diversity of the SARDet-100K dataset provide researchers with robust training and evaluation for advancing SAR object detection algorithms and techniques, fostering the development of SOTA models in this domain.
SciEval is a comprehensive and multi-disciplinary evaluation benchmark designed to assess the performance of large language models (LLMs) in the scientific domain. It addresses several critical issues related to evaluating LLMs for scientific research.
The SAP benchmark is a significant development in the realm of attack prompt generation for red teaming and defending large language models (LLMs). Let's delve into the details:
HEIM stands for Holistic Evaluation of Text-To-Image Models. It is a comprehensive benchmark designed to assess the capabilities and risks of text-to-image generation models. Unlike previous evaluations that primarily focused on image-text alignment and image quality, HEIM considers 12 different aspects that are crucial for real-world model deployment:
MedConceptsQA - Open Source Medical Concepts QA Benchmark
GenAI-Bench is a benchmarking framework designed to evaluate and improve compositional text-to-visual generation models. It was developed by researchers from Carnegie Mellon University and Meta. The key aspects of GenAI-Bench include:
This dataset encompasses a diverse range of tactile features that are instrumental in bifurcating various material properties. Three downstream tasks are considered: 1) categorization of materials, 2) distinction between hard and soft surfaces, and 3) distinction between smooth and textured surfaces.
Source: ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
NTU4DRadLM is a novel 4D radar dataset specifically proposed for research on robust SLAM, based on 4D radar, thermal camera, and IMU. Totally, the dataset is around 17.6km, 85mins, 50GB.
The Ghostbusters dataset leverages the GPT-3.5-turbo model for generating texts in the domains of creative writing, news, and student essays, providing 2,000 texts in the first two domains and 1,994 in the latter.
It contains 15K triplets of essay problem statements, student-written, and LLM-generated essays.
It is collection regrouping all datasets constituting the ViDoRe benchmark. It includes the test sets from different academic datasets (ArXiVQA, DocVQA, InfoVQA, TATDQA, TabFQuAD) and from datasets synthetically generated spanning various themes and industrial applications: (Artificial Intelligence, Government Reports, Healthcare Industry, Energy and Shift Project). Further details can be found on the corresponding dataset cards.