32 machine learning datasets
32 dataset results
JetNet is a particle cloud dataset, containing gluon, top quark, light quark jets saved in .csv format.
JetClass is a new large-scale dataset to facilitate deep learning research in particle physics. It consists of 100M particle jets for training, 5M for validation and 20M for testing. The dataset contains 10 classes of jets, simulated with MadGraph + Pythia + Delphes. A detailed description of the JetClass dataset is presented in the paper Particle Transformer for Jet Tagging. An interface to use the dataset is provided here.
Dataset of high-pT jets from simulations of LHC proton-proton collisions
Dataset of 50,000 top quark-antiquark (ttbar) events produced in proton-proton collisions at 14 TeV, overlaid with minimum bias events corresponding to a pileup of 200 on average. The dataset consists of detector hits as the input, generator particles as the ground truth and reconstructed particles from DELPHES for additional validation. The DELPHES model corresponds to a CMS-like detector with a multi-layered charged particle tracker, an electromagnetic and hadron calorimeter. Pythia8 and Delphes3 were used for the simulation.
CMD is a publicly available collection of hundreds of thousands 2D maps and 3D grids containing different properties of the gas, dark matter, and stars from more than 2,000 different universes. The data has been generated from thousands of state-of-the-art (magneto-)hydrodynamic and gravity-only N-body simulations from the CAMELS project.
The dataset consists in many runs of the same quantum circuit on different IBM quantum machines. We used 9 different machines and for each one of them, we run 2000 executions of the circuit. The circuit has 9 differents measurement steps along it. To obtain the 9 outcome distributions, for each execution, parts of the circuit are appended 9 times (in the same call to the IBM API, thus, in the shortest possible time) measuring a new step each time. The calls to the IBM API followed two different strategies. One was adopted to maximize the number of calls to the interface, parallelizing the code with as many possible runs and even running 8000 shots per run but considering for 8 times 1000 out of the memory to get the probabilities. The other strategy was slower, without parallelization and with a minimum waiting time between subsequent executions. The latter was adopted to get more uniformly distributed executions in time.
Numerical simulations of Earth's weather and climate require substantial amounts of computation. This has led to a growing interest in replacing subroutines that explicitly compute physical processes with approximate machine learning (ML) methods that are fast at inference time. Within weather and climate models, atmospheric radiative transfer (RT) calculations are especially expensive. This has made them a popular target for neural network-based emulators. However, prior work is hard to compare due to the lack of a comprehensive dataset and standardized best practices for ML benchmarking. To fill this gap, we build a large dataset, ClimART, with more than \emph{10 million samples from present, pre-industrial, and future climate conditions}, based on the Canadian Earth System Model. ClimART poses several methodological challenges for the ML community, such as multiple out-of-distribution test sets, underlying domain physics, and a trade-off between accuracy and inference speed.
PDEBench provides a diverse and comprehensive set of benchmarks for scientific machine learning, including challenging and realistic physical problems. The repository consists of the code used to generate the datasets, to upload and download the datasets from the data repository, as well as to train and evaluate different machine learning models as baseline. PDEBench features a much wider range of PDEs than existing benchmarks and included realistic and difficult problems (both forward and inverse), larger ready-to-use datasets comprising various initial and boundary conditions, and PDE parameters. Moreover, PDEBench was crated to make the source code extensible and we invite active participation to improve and extent the benchmark.
DrivAerNet is a large-scale, high-fidelity CFD dataset of 3D industry-standard car shapes designed for data-driven aerodynamic design. It comprises 4000 high-quality 3D car meshes and their corresponding aerodynamic performance coefficients, alongside full 3D flow field information.
Contains data of parametric PDEs
This dataset is the outcome of a data challenge conducted as part of the Dark Machines Initiative and the Les Houches 2019 workshop on Physics at TeV colliders. The challenge aims at detecting signals of new physics at the LHC using unsupervised machine learning algorithms.
This dataset contains the data for the paper 'Using Multiple Instance Learning for Explainable Solar Flare Prediction'.
RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established. This is a dataset accompanying the paper RL Unplugged: Benchmarks for Offline Reinforcement Learning.
Multirotor gym environment for learning control policies for various unmanned aerial vehicles.
SuperCaustics is a simulation tool made in Unreal Engine for generating massive computer vision datasets that include transparent objects.
We present a structured benchmark dataset for a representative vibroacoustic problem: Predicting the frequency response for vibrating plates. The vibrating plates benchmark dataset consists of in total 12,000 varied plate designs and accompanying vibration patterns, when the plates are excited by a harmonic force. These vibration platterns give the vibration velocity at every location of the plate orthogonal to its surface. The plate designs incorporate randomly placed beadings, indentations in the plate surface. The beadings stiffen the plates and completely change the resulting vibration patterns. Additionally, the size, thickness and damping loss factor of the plates are varied.
The dataset is a .h5 file comprised of entries with keys of the form (n,m), denoting the dimensions of the system matrix on which the simulations have been performed. The value of each key are two arrays, one to store the number of iterations needed to terminate the process for each simulation, and the other for the number of elements present at the terminal state of each simulation. Thus, given the maximum number of elements n*m in each system, the estimated percolation threshold can be computed by averaging the ratios between the elements at each terminal state and the system size. Overall, 207950010 simulations have been performed. And, this dataset was used to perform a complexity analysis on: https://arxiv.org/abs/2410.11874
Datasets of QCD jets used for studying unfolding in OmniFold: A Method to Simultaneously Unfold All Observables. Four different datasets are present:
pd4ml is a collection of datasets from fundamental physics research -- including particle physics, astroparticle physics, and hadron- and nuclear physics -- for supervised machine learning studies. These datasets, containing hadronic top quarks, cosmic-ray induced air showers, phase transitions in hadronic matter, and generator-level histories, are made public to simplify future work on cross-disciplinary machine learning and transfer learning in fundamental physics.
Neural network model files and Madgraph event generator outputs used as inputs to the results presented in the paper "Learning to discover: expressive Gaussian mixture models for multi-dimensional simulation and parameter inference in the physical sciences" arXiv:2108.11481; 2022 Mach. Learn.: Sci. Technol. 3 015021 Code and model files can be found at: https://github.com/darrendavidprice/science-discovery/tree/master/expressive_gaussian_mixture_models