TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Translation between Molecules and Natural Language

Translation between Molecules and Natural Language

Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, Heng Ji

2022-04-25Self-Supervised LearningDrug DiscoveryTranslationText-based de novo Molecule GenerationMolecule Captioning
PaperPDFCode(official)

Abstract

We present $\textbf{MolT5}$ $-$ a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. $\textbf{MolT5}$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that $\textbf{MolT5}$-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.

Results

TaskDatasetMetricValueModel
Drug DiscoveryChEBI-20BLEU85.4MolT5-Large
Drug DiscoveryChEBI-20Exact Match30.2MolT5-Large
Drug DiscoveryChEBI-20Frechet ChemNet Distance (FCD)1.2MolT5-Large
Drug DiscoveryChEBI-20Levenshtein16.07MolT5-Large
Drug DiscoveryChEBI-20MACCS FTS83.4MolT5-Large
Drug DiscoveryChEBI-20Morgan FTS68.4MolT5-Large
Drug DiscoveryChEBI-20Parameter Count770000000MolT5-Large
Drug DiscoveryChEBI-20RDK FTS74.6MolT5-Large
Drug DiscoveryChEBI-20Text2Mol55.4MolT5-Large
Drug DiscoveryChEBI-20Validity90.5MolT5-Large
Drug DiscoveryChEBI-20BLEU81MolT5-Large-HV
Drug DiscoveryChEBI-20Exact Match31.4MolT5-Large-HV
Drug DiscoveryChEBI-20Frechet ChemNet Distance (FCD)0.44MolT5-Large-HV
Drug DiscoveryChEBI-20Levenshtein16.758MolT5-Large-HV
Drug DiscoveryChEBI-20MACCS FTS87.2MolT5-Large-HV
Drug DiscoveryChEBI-20Morgan FTS72.2MolT5-Large-HV
Drug DiscoveryChEBI-20Parameter Count770000000MolT5-Large-HV
Drug DiscoveryChEBI-20RDK FTS78.6MolT5-Large-HV
Drug DiscoveryChEBI-20Text2Mol59MolT5-Large-HV
Drug DiscoveryChEBI-20Validity99.6MolT5-Large-HV
Drug DiscoveryChEBI-20BLEU76.9MolT5-base
Drug DiscoveryChEBI-20Exact Match8.1MolT5-base
Drug DiscoveryChEBI-20Frechet ChemNet Distance (FCD)2.18MolT5-base
Drug DiscoveryChEBI-20Levenshtein24.458MolT5-base
Drug DiscoveryChEBI-20MACCS FTS72.1MolT5-base
Drug DiscoveryChEBI-20Morgan FTS52.9MolT5-base
Drug DiscoveryChEBI-20Parameter Count220000000MolT5-base
Drug DiscoveryChEBI-20RDK FTS58.8MolT5-base
Drug DiscoveryChEBI-20Text2Mol49.6MolT5-base
Drug DiscoveryChEBI-20Validity77.2MolT5-base
Drug DiscoveryChEBI-20BLEU75.5MolT5-small
Drug DiscoveryChEBI-20Exact Match7.9MolT5-small
Drug DiscoveryChEBI-20Frechet ChemNet Distance (FCD)2.49MolT5-small
Drug DiscoveryChEBI-20Levenshtein25.988MolT5-small
Drug DiscoveryChEBI-20MACCS FTS70.3MolT5-small
Drug DiscoveryChEBI-20Morgan FTS51.7MolT5-small
Drug DiscoveryChEBI-20Parameter Count60000000MolT5-small
Drug DiscoveryChEBI-20RDK FTS56.8MolT5-small
Drug DiscoveryChEBI-20Text2Mol48.2MolT5-small
Drug DiscoveryChEBI-20Validity72.1MolT5-small
Molecule CaptioningChEBI-20BLEU-259.4MolT5-Large
Molecule CaptioningChEBI-20BLEU-450.8MolT5-Large
Molecule CaptioningChEBI-20METEOR61.4MolT5-Large
Molecule CaptioningChEBI-20ROUGE-165.4MolT5-Large
Molecule CaptioningChEBI-20ROUGE-251MolT5-Large
Molecule CaptioningChEBI-20ROUGE-L59.4MolT5-Large
Molecule CaptioningChEBI-20Text2Mol58.2MolT5-Large
Molecule CaptioningChEBI-20BLEU-254MolT5-Base
Molecule CaptioningChEBI-20BLEU-445.7MolT5-Base
Molecule CaptioningChEBI-20METEOR56.9MolT5-Base
Molecule CaptioningChEBI-20ROUGE-163.4MolT5-Base
Molecule CaptioningChEBI-20ROUGE-248.5MolT5-Base
Molecule CaptioningChEBI-20ROUGE-L57.8MolT5-Base
Molecule CaptioningChEBI-20Text2Mol54.7MolT5-Base
Molecule CaptioningChEBI-20BLEU-251.9MolT5-Small
Molecule CaptioningChEBI-20BLEU-443.6MolT5-Small
Molecule CaptioningChEBI-20METEOR55.1MolT5-Small
Molecule CaptioningChEBI-20ROUGE-162MolT5-Small
Molecule CaptioningChEBI-20ROUGE-246.9MolT5-Small
Molecule CaptioningChEBI-20ROUGE-L56.3MolT5-Small
Molecule CaptioningChEBI-20Text2Mol54MolT5-Small
Molecule CaptioningL+M-24BLEU-276.9MolT5-Large
Molecule CaptioningL+M-24BLEU-455.6MolT5-Large
Molecule CaptioningL+M-24METEOR74.3MolT5-Large
Molecule CaptioningL+M-24ROUGE-177.7MolT5-Large
Molecule CaptioningL+M-24ROUGE-258MolT5-Large
Molecule CaptioningL+M-24ROUGE-L55.7MolT5-Large
Molecule CaptioningL+M-24BLEU-273.8MolT5-Base
Molecule CaptioningL+M-24BLEU-453.5MolT5-Base
Molecule CaptioningL+M-24METEOR71.8MolT5-Base
Molecule CaptioningL+M-24ROUGE-175MolT5-Base
Molecule CaptioningL+M-24ROUGE-255.9MolT5-Base
Molecule CaptioningL+M-24ROUGE-L53.9MolT5-Base
Molecule CaptioningL+M-24BLEU-270.9MolT5-Small
Molecule CaptioningL+M-24BLEU-451.2MolT5-Small
Molecule CaptioningL+M-24METEOR70.1MolT5-Small
Molecule CaptioningL+M-24ROUGE-174.5MolT5-Small
Molecule CaptioningL+M-24ROUGE-255.8MolT5-Small
Molecule CaptioningL+M-24ROUGE-L54.4MolT5-Small
Text-based de novo Molecule GenerationChEBI-20BLEU85.4MolT5-Large
Text-based de novo Molecule GenerationChEBI-20Exact Match30.2MolT5-Large
Text-based de novo Molecule GenerationChEBI-20Frechet ChemNet Distance (FCD)1.2MolT5-Large
Text-based de novo Molecule GenerationChEBI-20Levenshtein16.07MolT5-Large
Text-based de novo Molecule GenerationChEBI-20MACCS FTS83.4MolT5-Large
Text-based de novo Molecule GenerationChEBI-20Morgan FTS68.4MolT5-Large
Text-based de novo Molecule GenerationChEBI-20Parameter Count770000000MolT5-Large
Text-based de novo Molecule GenerationChEBI-20RDK FTS74.6MolT5-Large
Text-based de novo Molecule GenerationChEBI-20Text2Mol55.4MolT5-Large
Text-based de novo Molecule GenerationChEBI-20Validity90.5MolT5-Large
Text-based de novo Molecule GenerationChEBI-20BLEU81MolT5-Large-HV
Text-based de novo Molecule GenerationChEBI-20Exact Match31.4MolT5-Large-HV
Text-based de novo Molecule GenerationChEBI-20Frechet ChemNet Distance (FCD)0.44MolT5-Large-HV
Text-based de novo Molecule GenerationChEBI-20Levenshtein16.758MolT5-Large-HV
Text-based de novo Molecule GenerationChEBI-20MACCS FTS87.2MolT5-Large-HV
Text-based de novo Molecule GenerationChEBI-20Morgan FTS72.2MolT5-Large-HV
Text-based de novo Molecule GenerationChEBI-20Parameter Count770000000MolT5-Large-HV
Text-based de novo Molecule GenerationChEBI-20RDK FTS78.6MolT5-Large-HV
Text-based de novo Molecule GenerationChEBI-20Text2Mol59MolT5-Large-HV
Text-based de novo Molecule GenerationChEBI-20Validity99.6MolT5-Large-HV
Text-based de novo Molecule GenerationChEBI-20BLEU76.9MolT5-base
Text-based de novo Molecule GenerationChEBI-20Exact Match8.1MolT5-base
Text-based de novo Molecule GenerationChEBI-20Frechet ChemNet Distance (FCD)2.18MolT5-base
Text-based de novo Molecule GenerationChEBI-20Levenshtein24.458MolT5-base
Text-based de novo Molecule GenerationChEBI-20MACCS FTS72.1MolT5-base
Text-based de novo Molecule GenerationChEBI-20Morgan FTS52.9MolT5-base
Text-based de novo Molecule GenerationChEBI-20Parameter Count220000000MolT5-base
Text-based de novo Molecule GenerationChEBI-20RDK FTS58.8MolT5-base
Text-based de novo Molecule GenerationChEBI-20Text2Mol49.6MolT5-base
Text-based de novo Molecule GenerationChEBI-20Validity77.2MolT5-base
Text-based de novo Molecule GenerationChEBI-20BLEU75.5MolT5-small
Text-based de novo Molecule GenerationChEBI-20Exact Match7.9MolT5-small
Text-based de novo Molecule GenerationChEBI-20Frechet ChemNet Distance (FCD)2.49MolT5-small
Text-based de novo Molecule GenerationChEBI-20Levenshtein25.988MolT5-small
Text-based de novo Molecule GenerationChEBI-20MACCS FTS70.3MolT5-small
Text-based de novo Molecule GenerationChEBI-20Morgan FTS51.7MolT5-small
Text-based de novo Molecule GenerationChEBI-20Parameter Count60000000MolT5-small
Text-based de novo Molecule GenerationChEBI-20RDK FTS56.8MolT5-small
Text-based de novo Molecule GenerationChEBI-20Text2Mol48.2MolT5-small
Text-based de novo Molecule GenerationChEBI-20Validity72.1MolT5-small

Related Papers

A Semi-Supervised Learning Method for the Identification of Bad Exposures in Large Imaging Surveys2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16A Graph-in-Graph Learning Framework for Drug-Target Interaction Prediction2025-07-15Function-to-Style Guidance of LLMs for Code Translation2025-07-15Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder2025-07-14Speak2Sign3D: A Multi-modal Pipeline for English Speech to American Sign Language Animation2025-07-09Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings2025-07-09