Roman Bushuiev, Anton Bushuiev, Niek F. de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils A. Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo wang, David S. Wishart, Li-Ping Liu, Juho Rousu, Wout Bittremieux, Hannes Rost, Tytus D. Mak, Soha Hassoun, Florian Huber, Justin J. J. van der Hooft, Michael A. Stravs, Sebastian Böcker, Josef Sivic, Tomáš Pluskal
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at https://github.com/pluskal-lab/MassSpecGym.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-1 MCES | 28.59 | Random chemical generation |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-1 Tanimoto | 0.07 | Random chemical generation |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-10 MCES | 25.72 | Random chemical generation |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-10 Tanimoto | 0.1 | Random chemical generation |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-1 MCES | 33.28 | SELFIES Transformer |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-1 Tanimoto | 0.1 | SELFIES Transformer |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-10 MCES | 21.84 | SELFIES Transformer |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-10 Tanimoto | 0.15 | SELFIES Transformer |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-1 MCES | 53.8 | SMILES Transformer |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-1 Tanimoto | 0.07 | SMILES Transformer |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-10 MCES | 21.97 | SMILES Transformer |
| De novo molecule generation from MS/MS spectrum | MassSpecGym | Top-10 Tanimoto | 0.17 | SMILES Transformer |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 1 | 14.64 | MIST |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 20 | 59.15 | MIST |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 5 | 34.87 | MIST |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | MCES @ 1 | 15.37 | MIST |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 1 | 5.24 | DeepSets + Fourier features |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 20 | 28.21 | DeepSets + Fourier features |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 5 | 12.58 | DeepSets + Fourier features |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | MCES @ 1 | 22.13 | DeepSets + Fourier features |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 1 | 2.54 | Fingerprint FFN |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 20 | 20 | Fingerprint FFN |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 5 | 7.59 | Fingerprint FFN |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | MCES @ 1 | 24.66 | Fingerprint FFN |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 1 | 1.47 | DeepSets |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 20 | 19.23 | DeepSets |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 5 | 6.21 | DeepSets |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | MCES @ 1 | 25.11 | DeepSets |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 1 | 0.37 | Random |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 20 | 8.22 | Random |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | Hit rate @ 5 | 2.01 | Random |
| Molecule retrieval from MS/MS spectrum | MassSpecGym | MCES @ 1 | 30.81 | Random |
| MS/MS spectrum simulation | MassSpecGym | Cosine Similarity | 0.52 | FraGNNet |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 1 | 46.64 | FraGNNet |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 20 | 83.58 | FraGNNet |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 5 | 72.56 | FraGNNet |
| MS/MS spectrum simulation | MassSpecGym | Jensen-Shannon Similarity | 0.47 | FraGNNet |
| MS/MS spectrum simulation | MassSpecGym | Cosine Similarity | 0.25 | FFN Fingerprint |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 1 | 8.44 | FFN Fingerprint |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 20 | 38.57 | FFN Fingerprint |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 5 | 21.43 | FFN Fingerprint |
| MS/MS spectrum simulation | MassSpecGym | Jensen-Shannon Similarity | 0.24 | FFN Fingerprint |
| MS/MS spectrum simulation | MassSpecGym | Cosine Similarity | 0.19 | GNN |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 1 | 3.95 | GNN |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 20 | 26.27 | GNN |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 5 | 11.92 | GNN |
| MS/MS spectrum simulation | MassSpecGym | Jensen-Shannon Similarity | 0.2 | GNN |
| MS/MS spectrum simulation | MassSpecGym | Cosine Similarity | 0.15 | Precursor m/z |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 1 | 0.38 | Precursor m/z |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 20 | 7.17 | Precursor m/z |
| MS/MS spectrum simulation | MassSpecGym | Hit Rate @ 5 | 1.72 | Precursor m/z |
| MS/MS spectrum simulation | MassSpecGym | Jensen-Shannon Similarity | 0.15 | Precursor m/z |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-1 MCES | 21.11 | Random chemical generation |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-1 Tanimoto | 0.08 | Random chemical generation |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-10 MCES | 18.25 | Random chemical generation |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-10 Tanimoto | 0.11 | Random chemical generation |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-1 MCES | 38.88 | SELFIES Transformer |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-1 Tanimoto | 0.08 | SELFIES Transformer |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-10 MCES | 26.87 | SELFIES Transformer |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-10 Tanimoto | 0.13 | SELFIES Transformer |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-1 MCES | 79.39 | SMILES Transformer |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-1 Tanimoto | 0.03 | SMILES Transformer |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-10 MCES | 52.13 | SMILES Transformer |
| De novo molecule generation from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Top-10 Tanimoto | 0.1 | SMILES Transformer |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 1 | 31.93 | FraGNNet |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 20 | 82.7 | FraGNNet |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 5 | 63.2 | FraGNNet |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 1 | 7.62 | FFN Fingerprint |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 20 | 44.12 | FFN Fingerprint |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 5 | 22.7 | FFN Fingerprint |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 1 | 3.63 | GNN |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 20 | 33.77 | GNN |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 5 | 13.55 | GNN |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 1 | 2.09 | Precursor m/z |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 20 | 22.65 | Precursor m/z |
| MS/MS spectrum simulation (bonus chemical formulae) | MassSpecGym | Hit Rate @ 5 | 8.52 | Precursor m/z |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 1 | 9.57 | MIST |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 20 | 41.12 | MIST |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 5 | 22.11 | MIST |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | MCES @ 1 | 12.75 | MIST |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 1 | 6.56 | DeepSets + Fourier features |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 20 | 33.46 | DeepSets + Fourier features |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 5 | 16.46 | DeepSets + Fourier features |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | MCES @ 1 | 14.14 | DeepSets + Fourier features |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 1 | 5.09 | Fingerprint FFN |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 20 | 31.97 | Fingerprint FFN |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 5 | 14.69 | Fingerprint FFN |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | MCES @ 1 | 14.94 | Fingerprint FFN |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 1 | 4.42 | DeepSets |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 20 | 30.76 | DeepSets |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 5 | 14.46 | DeepSets |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | MCES @ 1 | 15.04 | DeepSets |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 1 | 3.06 | Random |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 20 | 27.74 | Random |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | Hit rate @ 5 | 11.35 | Random |
| Molecule retrieval from MS/MS spectrum (bonus chemical formulae) | MassSpecGym | MCES @ 1 | 13.87 | Random |