Datasets

32 machine learning datasets

32 dataset results

JetNet

JetNet is a particle cloud dataset, containing gluon, top quark, light quark jets saved in .csv format.

11 papers0 benchmarksPhysics, Point cloud

JetClass (A Large-Scale Dataset for Deep Learning in Jet Physics)

JetClass is a new large-scale dataset to facilitate deep learning research in particle physics. It consists of 100M particle jets for training, 5M for validation and 20M for testing. The dataset contains 10 classes of jets, simulated with MadGraph + Pythia + Delphes. A detailed description of the JetClass dataset is presented in the paper Particle Transformer for Jet Tagging. An interface to use the dataset is provided here.

11 papers4 benchmarksPhysics, Point cloud

hls4ml LHC Jet dataset (hls4ml LHC Jet dataset (100 particles))

Dataset of high-pT jets from simulations of LHC proton-proton collisions

8 papers0 benchmarksPhysics

MLPF (Simulated particle-level dataset of ttbar with PU200 using Pythia8+Delphes3 for machine learned particle flow (MLPF))

Dataset of 50,000 top quark-antiquark (ttbar) events produced in proton-proton collisions at 14 TeV, overlaid with minimum bias events corresponding to a pileup of 200 on average. The dataset consists of detector hits as the input, generator particles as the ground truth and reconstructed particles from DELPHES for additional validation. The DELPHES model corresponds to a CMS-like detector with a multi-layered charged particle tracker, an electromagnetic and hadron calorimeter. Pythia8 and Delphes3 were used for the simulation.

7 papers0 benchmarksPhysics

CAMELS Multifield Dataset

CMD is a publicly available collection of hundreds of thousands 2D maps and 3D grids containing different properties of the gas, dark matter, and stars from more than 2,000 different universes. The data has been generated from thousands of state-of-the-art (magneto-)hydrodynamic and gravity-only N-body simulations from the CAMELS project.

6 papers0 benchmarks3D, Images, Physics

quantumNoise

The dataset consists in many runs of the same quantum circuit on different IBM quantum machines. We used 9 different machines and for each one of them, we run 2000 executions of the circuit. The circuit has 9 differents measurement steps along it. To obtain the 9 outcome distributions, for each execution, parts of the circuit are appended 9 times (in the same call to the IBM API, thus, in the shortest possible time) measuring a new step each time. The calls to the IBM API followed two different strategies. One was adopted to maximize the number of calls to the interface, parallelizing the code with as many possible runs and even running 8000 shots per run but considering for 8 times 1000 out of the memory to get the probabilities. The other strategy was slower, without parallelization and with a minimum waiting time between subsequent executions. The latter was adopted to get more uniformly distributed executions in time.

5 papers0 benchmarksPhysics

ClimART (Climate Atmospheric Radiative Transfer)

Numerical simulations of Earth's weather and climate require substantial amounts of computation. This has led to a growing interest in replacing subroutines that explicitly compute physical processes with approximate machine learning (ML) methods that are fast at inference time. Within weather and climate models, atmospheric radiative transfer (RT) calculations are especially expensive. This has made them a popular target for neural network-based emulators. However, prior work is hard to compare due to the lack of a comprehensive dataset and standardized best practices for ML benchmarking. To fill this gap, we build a large dataset, ClimART, with more than \emph{10 million samples from present, pre-industrial, and future climate conditions}, based on the Canadian Earth System Model. ClimART poses several methodological challenges for the ML community, such as multiple out-of-distribution test sets, underlying domain physics, and a trade-off between accuracy and inference speed.

4 papers0 benchmarksEnvironment, Physics

PDEBench - Benchmark for Scientific Machine Learning

PDEBench provides a diverse and comprehensive set of benchmarks for scientific machine learning, including challenging and realistic physical problems. The repository consists of the code used to generate the datasets, to upload and download the datasets from the data repository, as well as to train and evaluate different machine learning models as baseline. PDEBench features a much wider range of PDEs than existing benchmarks and included realistic and difficult problems (both forward and inverse), larger ready-to-use datasets comprising various initial and boundary conditions, and PDE parameters. Moreover, PDEBench was crated to make the source code extensible and we invite active participation to improve and extent the benchmark.

4 papers0 benchmarksImages, Physics, Time series, Videos

DrivAerNet (A Parametric Car Dataset for Data-driven Aerodynamic Design and Graph-Based Drag Prediction)

DrivAerNet is a large-scale, high-fidelity CFD dataset of 3D industry-standard car shapes designed for data-driven aerodynamic design. It comprises 4000 high-quality 3D car meshes and their corresponding aerodynamic performance coefficients, alongside full 3D flow field information.

4 papers1 benchmarks3D, 3d meshes, Physics, Point cloud, Tabular

PDE dataset (Parametric Partial Differential Equation dataset)

Contains data of parametric PDEs

3 papers0 benchmarksPhysics

Dark Machines Anomaly Score

This dataset is the outcome of a data challenge conducted as part of the Dark Machines Initiative and the Les Houches 2019 workshop on Physics at TeV colliders. The challenge aims at detecting signals of new physics at the LHC using unsupervised machine learning algorithms.

3 papers0 benchmarksPhysics

IRIS Multiple Instance Learning Dataset

This dataset contains the data for the paper 'Using Multiple Instance Learning for Explainable Solar Flare Prediction'.

3 papers0 benchmarksHyperspectral images, Physics

RLU (RL Unplugged)

RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established. This is a dataset accompanying the paper RL Unplugged: Benchmarks for Offline Reinforcement Learning.

2 papers0 benchmarksActions, Environment, Images, Physics, RGB Video, Replay data

Multirotor-Gym

Multirotor gym environment for learning control policies for various unmanned aerial vehicles.

2 papers0 benchmarksEnvironment, Physics

SuperCaustics

SuperCaustics is a simulation tool made in Unreal Engine for generating massive computer vision datasets that include transparent objects.

2 papers0 benchmarksEnvironment, Images, Interactive, LiDAR, Physics, RGB Video, RGB-D

Vibrating Plates (Vibrating Plates Dataset for Vibroacoustic Frequency Response Prediction)

We present a structured benchmark dataset for a representative vibroacoustic problem: Predicting the frequency response for vibrating plates. The vibrating plates benchmark dataset consists of in total 12,000 varied plate designs and accompanying vibration patterns, when the plates are excited by a harmonic force. These vibration platterns give the vibration velocity at every location of the plate orthogonal to its surface. The plate designs incorporate randomly placed beadings, indentations in the plate surface. The beadings stiffen the plates and completely change the resulting vibration patterns. Additionally, the size, thickness and damping loss factor of the plates are varied.

2 papers0 benchmarksPhysics

2D site-percolation threshold (Daniel García Solla)

The dataset is a .h5 file comprised of entries with keys of the form (n,m), denoting the dimensions of the system matrix on which the simulations have been performed. The value of each key are two arrays, one to store the number of iterations needed to terminate the process for each simulation, and the other for the number of elements present at the terminal state of each simulation. Thus, given the maximum number of elements n*m in each system, the estimated percolation threshold can be computed by averaging the ratios between the elements at each terminal state and the system size. Overall, 207950010 simulations have been performed. And, this dataset was used to perform a complexity analysis on: https://arxiv.org/abs/2410.11874

2 papers0 benchmarksPhysics

Pythia/Herwig + Delphes Jet Datasets for OmniFold Unfolding

Datasets of QCD jets used for studying unfolding in OmniFold: A Method to Simultaneously Unfold All Observables. Four different datasets are present:

2 papers0 benchmarksPhysics

pd4ml (Physics Data for Machine Learning)

pd4ml is a collection of datasets from fundamental physics research -- including particle physics, astroparticle physics, and hadron- and nuclear physics -- for supervised machine learning studies. These datasets, containing hadronic top quarks, cosmic-ray induced air showers, phase transitions in hadronic matter, and generator-level histories, are made public to simplify future work on cross-disciplinary machine learning and transfer learning in fundamental physics.

1 papers0 benchmarksPhysics

Expressive Gaussian mixture models for high-dimensional statistical modelling: simulated data and neural network model files

Neural network model files and Madgraph event generator outputs used as inputs to the results presented in the paper "Learning to discover: expressive Gaussian mixture models for multi-dimensional simulation and parameter inference in the physical sciences" arXiv:2108.11481; 2022 Mach. Learn.: Sci. Technol. 3 015021 Code and model files can be found at: https://github.com/darrendavidprice/science-discovery/tree/master/expressive_gaussian_mixture_models

1 papers0 benchmarksPhysics

Page 1 of 2Next