TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Permutation invariant graph-to-sequence model for template...

Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction

Zhengkai Tu, Connor W. Coley

2021-10-19Machine TranslationText GenerationGraph-to-SequenceData AugmentationTranslationRetrosynthesisSingle-step retrosynthesis
PaperPDFCode(official)

Abstract

Synthesis planning and reaction outcome prediction are two fundamental problems in computer-aided organic chemistry for which a variety of data-driven approaches have emerged. Natural language approaches that model each problem as a SMILES-to-SMILES translation lead to a simple end-to-end formulation, reduce the need for data preprocessing, and enable the use of well-optimized machine translation model architectures. However, SMILES representations are not an efficient representation for capturing information about molecular structures, as evidenced by the success of SMILES augmentation to boost empirical performance. Here, we describe a novel Graph2SMILES model that combines the power of Transformer models for text generation with the permutation invariance of molecular graph encoders that mitigates the need for input data augmentation. As an end-to-end architecture, Graph2SMILES can be used as a drop-in replacement for the Transformer in any task involving molecule(s)-to-molecule(s) transformations. In our encoder, an attention-augmented directed message passing neural network (D-MPNN) captures local chemical environments, and the global attention encoder allows for long-range and intermolecular interactions, enhanced by graph-aware positional embedding. Graph2SMILES improves the top-1 accuracy of the Transformer baselines by $1.7\%$ and $1.9\%$ for reaction outcome prediction on USPTO_480k and USPTO_STEREO datasets respectively, and by $9.8\%$ for one-step retrosynthesis on the USPTO_50k dataset.

Results

TaskDatasetMetricValueModel
Single-step retrosynthesisUSPTO-50kTop-1 accuracy52.9Graph2SMILES-D-GCN (reaction class unknown)
Single-step retrosynthesisUSPTO-50kTop-10 accuracy72.9Graph2SMILES-D-GCN (reaction class unknown)
Single-step retrosynthesisUSPTO-50kTop-3 accuracy66.5Graph2SMILES-D-GCN (reaction class unknown)
Single-step retrosynthesisUSPTO-50kTop-5 accuracy70Graph2SMILES-D-GCN (reaction class unknown)
Single-step retrosynthesisUSPTO-50kTop-1 accuracy51.2Graph2SMILES-D-GAT (reaction class unknown)
Single-step retrosynthesisUSPTO-50kTop-10 accuracy73.9Graph2SMILES-D-GAT (reaction class unknown)
Single-step retrosynthesisUSPTO-50kTop-3 accuracy66.3Graph2SMILES-D-GAT (reaction class unknown)
Single-step retrosynthesisUSPTO-50kTop-5 accuracy70.4Graph2SMILES-D-GAT (reaction class unknown)

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15