TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MolXPT: Wrapping Molecules with Text for Generative Pre-tr...

MolXPT: Wrapping Molecules with Text for Generative Pre-training

Zequn Liu, Wei zhang, Yingce Xia, Lijun Wu, Shufang Xie, Tao Qin, Ming Zhang, Tie-Yan Liu

2023-05-18Molecular Property PredictionText-based de novo Molecule GenerationLanguage ModellingMolecule Captioning
PaperPDFCode(official)

Abstract

Generative pre-trained Transformer (GPT) has demonstrates its great success in natural language processing and related techniques have been adapted into molecular modeling. Considering that text is the most important record for scientific discovery, in this paper, we propose MolXPT, a unified language model of text and molecules pre-trained on SMILES (a sequence representation of molecules) wrapped by text. Briefly, we detect the molecule names in each sequence and replace them to the corresponding SMILES. In this way, the SMILES could leverage the information from surrounding text, and vice versa. The above wrapped sequences, text sequences from PubMed and SMILES sequences from PubChem are all fed into a language model for pre-training. Experimental results demonstrate that MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet, performs comparably to the best model in text-molecule translation while using less than half of its parameters, and enables zero-shot molecular generation without finetuning.

Results

TaskDatasetMetricValueModel
Drug DiscoveryChEBI-20Exact Match21.5MolXPT
Drug DiscoveryChEBI-20Frechet ChemNet Distance (FCD)0.45MolXPT
Drug DiscoveryChEBI-20MACCS FTS85.9MolXPT
Drug DiscoveryChEBI-20Morgan FTS66.7MolXPT
Drug DiscoveryChEBI-20Parameter Count350000000MolXPT
Drug DiscoveryChEBI-20RDK FTS75.7MolXPT
Drug DiscoveryChEBI-20Text2Mol57.8MolXPT
Drug DiscoveryChEBI-20Validity98.3MolXPT
Molecular Property PredictionHIV datasetAUC0.781MolXPT
Molecular Property PredictionSIDERROC-AUC71.7MolXPT
Molecular Property PredictionTox21ROC-AUC77.1MolXPT
Molecular Property PredictionBACEROC-AUC88.4MolXPT
Molecule CaptioningChEBI-20BLEU-259.4MolXPT
Molecule CaptioningChEBI-20BLEU-450.5MolXPT
Molecule CaptioningChEBI-20METEOR62.6MolXPT
Molecule CaptioningChEBI-20ROUGE-166MolXPT
Molecule CaptioningChEBI-20ROUGE-251.1MolXPT
Molecule CaptioningChEBI-20ROUGE-L59.7MolXPT
Molecule CaptioningChEBI-20Text2Mol59.4MolXPT
Atomistic DescriptionHIV datasetAUC0.781MolXPT
Atomistic DescriptionSIDERROC-AUC71.7MolXPT
Atomistic DescriptionTox21ROC-AUC77.1MolXPT
Atomistic DescriptionBACEROC-AUC88.4MolXPT
Text-based de novo Molecule GenerationChEBI-20Exact Match21.5MolXPT
Text-based de novo Molecule GenerationChEBI-20Frechet ChemNet Distance (FCD)0.45MolXPT
Text-based de novo Molecule GenerationChEBI-20MACCS FTS85.9MolXPT
Text-based de novo Molecule GenerationChEBI-20Morgan FTS66.7MolXPT
Text-based de novo Molecule GenerationChEBI-20Parameter Count350000000MolXPT
Text-based de novo Molecule GenerationChEBI-20RDK FTS75.7MolXPT
Text-based de novo Molecule GenerationChEBI-20Text2Mol57.8MolXPT
Text-based de novo Molecule GenerationChEBI-20Validity98.3MolXPT

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16