Haisong Gong, Qiang Liu, Shu Wu, Liang Wang
Text-guided molecule generation is a task where molecules are generated to match specific textual descriptions. Recently, most existing SMILES-based molecule generation methods rely on an autoregressive architecture. In this work, we propose the Text-Guided Molecule Generation with Diffusion Language Model (TGM-DLM), a novel approach that leverages diffusion models to address the limitations of autoregressive methods. TGM-DLM updates token embeddings within the SMILES string collectively and iteratively, using a two-phase diffusion generation process. The first phase optimizes embeddings from random noise, guided by the text description, while the second phase corrects invalid SMILES strings to form valid molecular representations. We demonstrate that TGM-DLM outperforms MolT5-Base, an autoregressive model, without the need for additional data resources. Our findings underscore the remarkable effectiveness of TGM-DLM in generating coherent and precise molecules with specific properties, opening new avenues in drug discovery and related scientific domains. Code will be released at: https://github.com/Deno-V/tgm-dlm.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Drug Discovery | ChEBI-20 | BLEU | 82.8 | TGM-DLM w/o corr |
| Drug Discovery | ChEBI-20 | Exact Match | 24.2 | TGM-DLM w/o corr |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.89 | TGM-DLM w/o corr |
| Drug Discovery | ChEBI-20 | Levenshtein | 16.897 | TGM-DLM w/o corr |
| Drug Discovery | ChEBI-20 | MACCS FTS | 87.4 | TGM-DLM w/o corr |
| Drug Discovery | ChEBI-20 | Morgan FTS | 72.2 | TGM-DLM w/o corr |
| Drug Discovery | ChEBI-20 | Parameter Count | 180000000 | TGM-DLM w/o corr |
| Drug Discovery | ChEBI-20 | RDK FTS | 77.1 | TGM-DLM w/o corr |
| Drug Discovery | ChEBI-20 | Text2Mol | 58.9 | TGM-DLM w/o corr |
| Drug Discovery | ChEBI-20 | Validity | 78.9 | TGM-DLM w/o corr |
| Drug Discovery | ChEBI-20 | BLEU | 82.6 | TGM-DLM |
| Drug Discovery | ChEBI-20 | Exact Match | 24.2 | TGM-DLM |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.77 | TGM-DLM |
| Drug Discovery | ChEBI-20 | Levenshtein | 17.003 | TGM-DLM |
| Drug Discovery | ChEBI-20 | MACCS FTS | 85.4 | TGM-DLM |
| Drug Discovery | ChEBI-20 | Morgan FTS | 68.8 | TGM-DLM |
| Drug Discovery | ChEBI-20 | Parameter Count | 180000000 | TGM-DLM |
| Drug Discovery | ChEBI-20 | RDK FTS | 73.9 | TGM-DLM |
| Drug Discovery | ChEBI-20 | Text2Mol | 58.1 | TGM-DLM |
| Drug Discovery | ChEBI-20 | Validity | 87.1 | TGM-DLM |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 82.8 | TGM-DLM w/o corr |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 24.2 | TGM-DLM w/o corr |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.89 | TGM-DLM w/o corr |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 16.897 | TGM-DLM w/o corr |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 87.4 | TGM-DLM w/o corr |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 72.2 | TGM-DLM w/o corr |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 180000000 | TGM-DLM w/o corr |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 77.1 | TGM-DLM w/o corr |
| Text-based de novo Molecule Generation | ChEBI-20 | Text2Mol | 58.9 | TGM-DLM w/o corr |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 78.9 | TGM-DLM w/o corr |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 82.6 | TGM-DLM |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 24.2 | TGM-DLM |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.77 | TGM-DLM |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 17.003 | TGM-DLM |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 85.4 | TGM-DLM |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 68.8 | TGM-DLM |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 180000000 | TGM-DLM |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 73.9 | TGM-DLM |
| Text-based de novo Molecule Generation | ChEBI-20 | Text2Mol | 58.1 | TGM-DLM |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 87.1 | TGM-DLM |