TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MassSpecGym: A benchmark for the discovery and identificat...

MassSpecGym: A benchmark for the discovery and identification of molecules

Roman Bushuiev, Anton Bushuiev, Niek F. de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils A. Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo wang, David S. Wishart, Li-Ping Liu, Juho Rousu, Wout Bittremieux, Hannes Rost, Tytus D. Mak, Soha Hassoun, Florian Huber, Justin J. J. van der Hooft, Michael A. Stravs, Sebastian Böcker, Josef Sivic, Tomáš Pluskal

2024-10-30De novo molecule generation from MS/MS spectrumMS/MS spectrum simulationDe novo molecule generation from MS/MS spectrum (bonus chemical formulae)Molecule retrieval from MS/MS spectrumMolecule retrieval from MS/MS spectrum (bonus chemical formulae)MS/MS spectrum simulation (bonus chemical formulae)
PaperPDFCode(official)

Abstract

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.

Results

TaskDatasetMetricValueModel
De novo molecule generation from MS/MS spectrumMassSpecGymTop-1 MCES28.59Random chemical generation
De novo molecule generation from MS/MS spectrumMassSpecGymTop-1 Tanimoto0.07Random chemical generation
De novo molecule generation from MS/MS spectrumMassSpecGymTop-10 MCES25.72Random chemical generation
De novo molecule generation from MS/MS spectrumMassSpecGymTop-10 Tanimoto0.1Random chemical generation
De novo molecule generation from MS/MS spectrumMassSpecGymTop-1 MCES33.28SELFIES Transformer
De novo molecule generation from MS/MS spectrumMassSpecGymTop-1 Tanimoto0.1SELFIES Transformer
De novo molecule generation from MS/MS spectrumMassSpecGymTop-10 MCES21.84SELFIES Transformer
De novo molecule generation from MS/MS spectrumMassSpecGymTop-10 Tanimoto0.15SELFIES Transformer
De novo molecule generation from MS/MS spectrumMassSpecGymTop-1 MCES53.8SMILES Transformer
De novo molecule generation from MS/MS spectrumMassSpecGymTop-1 Tanimoto0.07SMILES Transformer
De novo molecule generation from MS/MS spectrumMassSpecGymTop-10 MCES21.97SMILES Transformer
De novo molecule generation from MS/MS spectrumMassSpecGymTop-10 Tanimoto0.17SMILES Transformer
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 114.64MIST
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 2059.15MIST
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 534.87MIST
Molecule retrieval from MS/MS spectrumMassSpecGymMCES @ 115.37MIST
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 15.24DeepSets + Fourier features
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 2028.21DeepSets + Fourier features
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 512.58DeepSets + Fourier features
Molecule retrieval from MS/MS spectrumMassSpecGymMCES @ 122.13DeepSets + Fourier features
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 12.54Fingerprint FFN
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 2020Fingerprint FFN
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 57.59Fingerprint FFN
Molecule retrieval from MS/MS spectrumMassSpecGymMCES @ 124.66Fingerprint FFN
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 11.47DeepSets
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 2019.23DeepSets
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 56.21DeepSets
Molecule retrieval from MS/MS spectrumMassSpecGymMCES @ 125.11DeepSets
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 10.37Random
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 208.22Random
Molecule retrieval from MS/MS spectrumMassSpecGymHit rate @ 52.01Random
Molecule retrieval from MS/MS spectrumMassSpecGymMCES @ 130.81Random
MS/MS spectrum simulationMassSpecGymCosine Similarity0.52FraGNNet
MS/MS spectrum simulationMassSpecGymHit Rate @ 146.64FraGNNet
MS/MS spectrum simulationMassSpecGymHit Rate @ 2083.58FraGNNet
MS/MS spectrum simulationMassSpecGymHit Rate @ 572.56FraGNNet
MS/MS spectrum simulationMassSpecGymJensen-Shannon Similarity0.47FraGNNet
MS/MS spectrum simulationMassSpecGymCosine Similarity0.25FFN Fingerprint
MS/MS spectrum simulationMassSpecGymHit Rate @ 18.44FFN Fingerprint
MS/MS spectrum simulationMassSpecGymHit Rate @ 2038.57FFN Fingerprint
MS/MS spectrum simulationMassSpecGymHit Rate @ 521.43FFN Fingerprint
MS/MS spectrum simulationMassSpecGymJensen-Shannon Similarity0.24FFN Fingerprint
MS/MS spectrum simulationMassSpecGymCosine Similarity0.19GNN
MS/MS spectrum simulationMassSpecGymHit Rate @ 13.95GNN
MS/MS spectrum simulationMassSpecGymHit Rate @ 2026.27GNN
MS/MS spectrum simulationMassSpecGymHit Rate @ 511.92GNN
MS/MS spectrum simulationMassSpecGymJensen-Shannon Similarity0.2GNN
MS/MS spectrum simulationMassSpecGymCosine Similarity0.15Precursor m/z
MS/MS spectrum simulationMassSpecGymHit Rate @ 10.38Precursor m/z
MS/MS spectrum simulationMassSpecGymHit Rate @ 207.17Precursor m/z
MS/MS spectrum simulationMassSpecGymHit Rate @ 51.72Precursor m/z
MS/MS spectrum simulationMassSpecGymJensen-Shannon Similarity0.15Precursor m/z
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-1 MCES21.11Random chemical generation
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-1 Tanimoto0.08Random chemical generation
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-10 MCES18.25Random chemical generation
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-10 Tanimoto0.11Random chemical generation
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-1 MCES38.88SELFIES Transformer
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-1 Tanimoto0.08SELFIES Transformer
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-10 MCES26.87SELFIES Transformer
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-10 Tanimoto0.13SELFIES Transformer
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-1 MCES79.39SMILES Transformer
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-1 Tanimoto0.03SMILES Transformer
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-10 MCES52.13SMILES Transformer
De novo molecule generation from MS/MS spectrum (bonus chemical formulae)MassSpecGymTop-10 Tanimoto0.1SMILES Transformer
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 131.93FraGNNet
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 2082.7FraGNNet
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 563.2FraGNNet
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 17.62FFN Fingerprint
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 2044.12FFN Fingerprint
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 522.7FFN Fingerprint
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 13.63GNN
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 2033.77GNN
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 513.55GNN
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 12.09Precursor m/z
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 2022.65Precursor m/z
MS/MS spectrum simulation (bonus chemical formulae)MassSpecGymHit Rate @ 58.52Precursor m/z
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 19.57MIST
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 2041.12MIST
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 522.11MIST
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymMCES @ 112.75MIST
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 16.56DeepSets + Fourier features
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 2033.46DeepSets + Fourier features
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 516.46DeepSets + Fourier features
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymMCES @ 114.14DeepSets + Fourier features
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 15.09Fingerprint FFN
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 2031.97Fingerprint FFN
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 514.69Fingerprint FFN
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymMCES @ 114.94Fingerprint FFN
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 14.42DeepSets
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 2030.76DeepSets
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 514.46DeepSets
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymMCES @ 115.04DeepSets
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 13.06Random
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 2027.74Random
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymHit rate @ 511.35Random
Molecule retrieval from MS/MS spectrum (bonus chemical formulae)MassSpecGymMCES @ 113.87Random

Related Papers

DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra2025-02-13MADGEN: Mass-Spec attends to De Novo Molecular generation2025-01-03JESTR: Joint Embedding Space Technique for Ranking Candidate Molecules for the Annotation of Untargeted Metabolomics Data2024-11-18