Datasets

19,997 machine learning datasets

19,997 dataset results

Electronics

This data was collected by performing a breadth-first search on the user-product-review graph until termination, meaning that it is a fairly comprehensive collection of English-language product data. We split the full dataset into top-level categories, e.g. Books, Movies, Music. We do this mainly for practical reasons, as it allows each model and dataset to fit in memory on a single machine (requiring around 64GB RAM and 2-3 days to run our largest experiment). Note that splitting the data in this way has little impact on performance, as there are few links that cross top-level categories, and the hierarchical nature of our model means that few parameters are shared across categories.

2 papers1 benchmarksGraphs

ALM-Bench (All Languages Matter Benchmark)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarksImages, Texts

GEOBench-VLM

GEOBench-VLM, a comprehensive benchmark specifically designed to evaluate VLMs on geospatial tasks, including scene understanding, object counting, localization, fine-grained categorization, and temporal analysis. Our benchmark features over 10,000 manually verified instructions and covers a diverse set of variations in visual conditions, object type, and scale.

2 papers0 benchmarksImages, Texts

Blue Cells enumeration dataset (Learning to Count Objects in Images blue cells dataset)

Dataset created in the paper "Learning to Count Objects in Images" by Victor Lempitsky and Andrew Zisserman exists as a benchmark to have a dataset useful for cell enumeration. The dataset comes with 200 images of 256x256 resolution of artificial blue fluorescent cells, and the groundtruth consists of the spatial coordinates of each cells centroid. Useful for regression counting or density map estimation.

2 papers0 benchmarksImages

SC_Complexity_DS

Dataset of complexity metrics paired with vulnerable smart contracts. We used the dataset provided by Liu et al. We used the “Resource 3” on their GitHub2, which contains eight types of vulnerabilities, and then paired them with complexity metrics. The vulnerability types they measured can be found in our paper. This newly published dataset, released in 2023, includes 2,953 smart contracts and a total of 16,239 contracts which contained 258 vulnerable and 15981 neutral contracts.

2 papers0 benchmarks

AI-ArtBench

AI-ArtBench: An AI-generated Artistic Dataset AI-ArtBench is a dataset that contains 180,000+ art images. 60,000 of them are human-drawn art that was directly taken from ArtBench-10 dataset and the rest is generated equally using Latent Diffusion and Standard Diffusion models. The human-drawn art is in 256x256 resolution and images generated using Latent Diffusion and Standard Diffusion has 256x256 and 768x768 resolutions respectively.

2 papers0 benchmarks

Underwater Trash Detection

Underwater Trash Detection Dataset Overview The Underwater Trash Detection Dataset is a custom-annotated dataset designed to address the challenges of underwater trash detection caused by varying environmental features. Publicly available datasets alone are insufficient for training deep learning models due to domain-specific variations in underwater conditions. This dataset offers a cumulative, self-annotated collection of underwater images for detecting and classifying trash, providing a strong foundation for deep learning research and benchmark testing.

2 papers0 benchmarksImages, LiDAR, Texts

SEED-VIG (SJTU Emotion EEG Dataset)

The SEED-VIG dataset is composed of four parts. EEG features include: EEG_Feature_2Hz: EEG features (power spectral density: PSD, differential entropy: DE) from the total frequency band (1~50 Hz) with a 2 Hz frequency resolution. The fields "psd_movingAve", "psd_LDS", "de_movingAve", and "de_LDS" indicate PSD with a moving average, PSD with a linear dynamic system, DE with a moving average, and DE with a linear dynamic system, respectively. The data format is channelsample_numberfrequency_bands (1788525). The first 1–5 in the first dimension 'channel' correspond to temporal brain areas, and the last 7–17 correspond to posterior brain areas. EEG_Feature_5Bands: This part is similar to the EEG_feature_2Hz file except that EEG features (PSD, DE) are extracted from five frequency bands: delta (1~4 Hz), theta (4~8 Hz), alpha (8~14 Hz), beta (14~31 Hz), and gamma (31~50 Hz). The data format is channelsample numberfrequency bands (178855). Forehead EEG feature files have a similar architectur

2 papers0 benchmarksEEG

ULP Dataset

Hundreds of clean and poisoned models per dataset for Tiny-ImageNet, CIFAR10

2 papers0 benchmarks

CylinderFlow (CylinderFlow with dynamic grid)

This dataset captures incompressible fluid dynamics around a 2D circular cylinder within a channel: \begin{align} \nabla \cdot \mathbf{u} &= 0, \ \partial_t \mathbf{u} + (\mathbf{u} \cdot \nabla) \mathbf{u} &= \nu \nabla^2 \mathbf{u} - \frac{1}{\rho} \nabla p, \end{align} with boundary conditions set for velocity and pressure. It features 100 snapshots per case, with 7600 training, 1000 validation, and 1000 test samples.

2 papers0 benchmarks

Poisson Equation (Poisson Equation with unstructured grid)

\subsection{Poisson Equation} The Poisson equation with Dirichlet boundary conditions is studied: \begin{align} -\Delta u &= f, \quad \text{in } \Omega = [0,1]^2, \ u &= 0, \quad \text{on } \partial \Omega, \end{align} where (f) consists of a Gaussian superposition, with parameters (\mu_{x,i}, \mu_{y,i} \sim \text{U}(0,1)) and (\sigma_i \sim \text{U}(0.025, 0.1)). The dataset includes 4000 training, 500 validation, and 500 test samples.

2 papers0 benchmarks

V2VBench

V2VBench is a comprehensive benchmark designed to evaluate video editing methods. It consists of: - 50 standardized videos across 5 categories, and - 3 editing prompts per video, encompassing 4 editing tasks: Huggingface Datasets - 8 evaluation metrics to assess the quality of edited videos: Evaluation Metrics

2 papers0 benchmarksImages, Texts, Videos

Pythia/Herwig + Delphes Jet Datasets for OmniFold Unfolding

Datasets of QCD jets used for studying unfolding in OmniFold: A Method to Simultaneously Unfold All Observables. Four different datasets are present:

2 papers0 benchmarksPhysics

Text2CAD

Prototyping complex computer-aided design (CAD) models in modern softwares can be very time-consuming. This is due to the lack of intelligent systems that can quickly generate simpler intermediate parts. We propose Text2CAD, the first AI framework for generating text-to-parametric CAD models using designer-friendly instructions for all skill levels. Furthermore, we introduce a data annotation pipeline for generating text prompts based on natural language instructions for the DeepCAD dataset using Mistral and LLaVA-NeXT. The dataset contains $\sim170$K models and $\sim660$K text annotations, from abstract CAD descriptions (e.g., \textit{generate two concentric cylinders}) to detailed specifications (e.g., \textit{draw two circles with center} $(x,y)$ \textit{and radius} $r_{1}$, $r_{2}$, \textit{and extrude along the normal by} $d$...). Within the Text2CAD framework, we propose an end-to-end transformer-based auto-regressive network to generate parametric CAD models from input texts. We

2 papers0 benchmarks

TOMG-Bench (Text-based Open Molecule Generation Benchmark)

In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each task further contains three subtasks, with each subtask comprising 5,000 test samples. Given the inherent complexity of open molecule generation, we have also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations and potential areas for improvement in text-guided molecule discovery. Furthermore, with the assistance of OpenMolIns, a specialized instruction tuning dataset proposed for solving challenges raised by TOMG-Bench, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-tu

2 papers1 benchmarksGraphs, Texts

MGTAcademic

This repository provides a cleaned dataset, which is intended to be used for text classification, language modeling, and AI-generated content detection tasks. The dataset covers various fields such as STEM, Social Sciences, and Humanities, and contains datasets from different categories, each of which has been processed and cleaned for easy use. Move to our codebase fro more information (github)

2 papers0 benchmarks

ChronoMagic-Pro

Description

2 papers0 benchmarksTexts, Videos

IL-Datasets (Imitation Datasets)

Imitation learning field requires expert data to train agents in a task. Most often, this learning approach suffers from the absence of available data, which results in techniques being tested on its dataset. Creating datasets is a cumbersome process requiring researchers to train expert agents from scratch, record their interactions and test each benchmark method with newly created data. Moreover, creating new datasets for each new technique results in a lack of consistency in the evaluation process since each dataset can drastically vary in state and action distribution. In response, this work aims to address these issues by creating Imitation Learning Datasets, a toolkit that allows for: (i) curated expert policies with multithreaded support for faster dataset creation; (ii) readily available datasets and techniques with precise measurements; and (iii) sharing implementations of common imitation learning techniques.

2 papers0 benchmarks

CypherBench

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

SegRCDB

Pre-training is a strong strategy for enhancing visual models to efficiently train them with a limited number of labeled images. In semantic segmentation, creating annotation masks requires an intensive amount of labor and time, and therefore, a large-scale pre-training dataset with semantic labels is quite difficult to construct. Moreover, what matters in semantic segmentation pre-training has not been fully investigated. In this paper, we propose the Segmentation Radial Contour DataBase (SegRCDB), which for the first time applies formula-driven supervised learning for semantic segmentation. SegRCDB enables pre-training for semantic segmentation without real images or any manual semantic labels. SegRCDB is based on insights about what is important in pre-training for semantic segmentation and allows efficient pre-training. Pre-training with SegRCDB achieved higher mIoU than the pre-training with COCO-Stuff for fine-tuning on ADE-20k and Cityscapes with the same number of training imag

2 papers0 benchmarks

PreviousPage 356 of 1000Next