Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, Heng Ji
We present $\textbf{MolT5}$ $-$ a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. $\textbf{MolT5}$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Furthermore, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. Our results show that $\textbf{MolT5}$-based models are able to generate outputs, both molecules and captions, which in many cases are high quality.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Drug Discovery | ChEBI-20 | BLEU | 85.4 | MolT5-Large |
| Drug Discovery | ChEBI-20 | Exact Match | 30.2 | MolT5-Large |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 1.2 | MolT5-Large |
| Drug Discovery | ChEBI-20 | Levenshtein | 16.07 | MolT5-Large |
| Drug Discovery | ChEBI-20 | MACCS FTS | 83.4 | MolT5-Large |
| Drug Discovery | ChEBI-20 | Morgan FTS | 68.4 | MolT5-Large |
| Drug Discovery | ChEBI-20 | Parameter Count | 770000000 | MolT5-Large |
| Drug Discovery | ChEBI-20 | RDK FTS | 74.6 | MolT5-Large |
| Drug Discovery | ChEBI-20 | Text2Mol | 55.4 | MolT5-Large |
| Drug Discovery | ChEBI-20 | Validity | 90.5 | MolT5-Large |
| Drug Discovery | ChEBI-20 | BLEU | 81 | MolT5-Large-HV |
| Drug Discovery | ChEBI-20 | Exact Match | 31.4 | MolT5-Large-HV |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.44 | MolT5-Large-HV |
| Drug Discovery | ChEBI-20 | Levenshtein | 16.758 | MolT5-Large-HV |
| Drug Discovery | ChEBI-20 | MACCS FTS | 87.2 | MolT5-Large-HV |
| Drug Discovery | ChEBI-20 | Morgan FTS | 72.2 | MolT5-Large-HV |
| Drug Discovery | ChEBI-20 | Parameter Count | 770000000 | MolT5-Large-HV |
| Drug Discovery | ChEBI-20 | RDK FTS | 78.6 | MolT5-Large-HV |
| Drug Discovery | ChEBI-20 | Text2Mol | 59 | MolT5-Large-HV |
| Drug Discovery | ChEBI-20 | Validity | 99.6 | MolT5-Large-HV |
| Drug Discovery | ChEBI-20 | BLEU | 76.9 | MolT5-base |
| Drug Discovery | ChEBI-20 | Exact Match | 8.1 | MolT5-base |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 2.18 | MolT5-base |
| Drug Discovery | ChEBI-20 | Levenshtein | 24.458 | MolT5-base |
| Drug Discovery | ChEBI-20 | MACCS FTS | 72.1 | MolT5-base |
| Drug Discovery | ChEBI-20 | Morgan FTS | 52.9 | MolT5-base |
| Drug Discovery | ChEBI-20 | Parameter Count | 220000000 | MolT5-base |
| Drug Discovery | ChEBI-20 | RDK FTS | 58.8 | MolT5-base |
| Drug Discovery | ChEBI-20 | Text2Mol | 49.6 | MolT5-base |
| Drug Discovery | ChEBI-20 | Validity | 77.2 | MolT5-base |
| Drug Discovery | ChEBI-20 | BLEU | 75.5 | MolT5-small |
| Drug Discovery | ChEBI-20 | Exact Match | 7.9 | MolT5-small |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 2.49 | MolT5-small |
| Drug Discovery | ChEBI-20 | Levenshtein | 25.988 | MolT5-small |
| Drug Discovery | ChEBI-20 | MACCS FTS | 70.3 | MolT5-small |
| Drug Discovery | ChEBI-20 | Morgan FTS | 51.7 | MolT5-small |
| Drug Discovery | ChEBI-20 | Parameter Count | 60000000 | MolT5-small |
| Drug Discovery | ChEBI-20 | RDK FTS | 56.8 | MolT5-small |
| Drug Discovery | ChEBI-20 | Text2Mol | 48.2 | MolT5-small |
| Drug Discovery | ChEBI-20 | Validity | 72.1 | MolT5-small |
| Molecule Captioning | ChEBI-20 | BLEU-2 | 59.4 | MolT5-Large |
| Molecule Captioning | ChEBI-20 | BLEU-4 | 50.8 | MolT5-Large |
| Molecule Captioning | ChEBI-20 | METEOR | 61.4 | MolT5-Large |
| Molecule Captioning | ChEBI-20 | ROUGE-1 | 65.4 | MolT5-Large |
| Molecule Captioning | ChEBI-20 | ROUGE-2 | 51 | MolT5-Large |
| Molecule Captioning | ChEBI-20 | ROUGE-L | 59.4 | MolT5-Large |
| Molecule Captioning | ChEBI-20 | Text2Mol | 58.2 | MolT5-Large |
| Molecule Captioning | ChEBI-20 | BLEU-2 | 54 | MolT5-Base |
| Molecule Captioning | ChEBI-20 | BLEU-4 | 45.7 | MolT5-Base |
| Molecule Captioning | ChEBI-20 | METEOR | 56.9 | MolT5-Base |
| Molecule Captioning | ChEBI-20 | ROUGE-1 | 63.4 | MolT5-Base |
| Molecule Captioning | ChEBI-20 | ROUGE-2 | 48.5 | MolT5-Base |
| Molecule Captioning | ChEBI-20 | ROUGE-L | 57.8 | MolT5-Base |
| Molecule Captioning | ChEBI-20 | Text2Mol | 54.7 | MolT5-Base |
| Molecule Captioning | ChEBI-20 | BLEU-2 | 51.9 | MolT5-Small |
| Molecule Captioning | ChEBI-20 | BLEU-4 | 43.6 | MolT5-Small |
| Molecule Captioning | ChEBI-20 | METEOR | 55.1 | MolT5-Small |
| Molecule Captioning | ChEBI-20 | ROUGE-1 | 62 | MolT5-Small |
| Molecule Captioning | ChEBI-20 | ROUGE-2 | 46.9 | MolT5-Small |
| Molecule Captioning | ChEBI-20 | ROUGE-L | 56.3 | MolT5-Small |
| Molecule Captioning | ChEBI-20 | Text2Mol | 54 | MolT5-Small |
| Molecule Captioning | L+M-24 | BLEU-2 | 76.9 | MolT5-Large |
| Molecule Captioning | L+M-24 | BLEU-4 | 55.6 | MolT5-Large |
| Molecule Captioning | L+M-24 | METEOR | 74.3 | MolT5-Large |
| Molecule Captioning | L+M-24 | ROUGE-1 | 77.7 | MolT5-Large |
| Molecule Captioning | L+M-24 | ROUGE-2 | 58 | MolT5-Large |
| Molecule Captioning | L+M-24 | ROUGE-L | 55.7 | MolT5-Large |
| Molecule Captioning | L+M-24 | BLEU-2 | 73.8 | MolT5-Base |
| Molecule Captioning | L+M-24 | BLEU-4 | 53.5 | MolT5-Base |
| Molecule Captioning | L+M-24 | METEOR | 71.8 | MolT5-Base |
| Molecule Captioning | L+M-24 | ROUGE-1 | 75 | MolT5-Base |
| Molecule Captioning | L+M-24 | ROUGE-2 | 55.9 | MolT5-Base |
| Molecule Captioning | L+M-24 | ROUGE-L | 53.9 | MolT5-Base |
| Molecule Captioning | L+M-24 | BLEU-2 | 70.9 | MolT5-Small |
| Molecule Captioning | L+M-24 | BLEU-4 | 51.2 | MolT5-Small |
| Molecule Captioning | L+M-24 | METEOR | 70.1 | MolT5-Small |
| Molecule Captioning | L+M-24 | ROUGE-1 | 74.5 | MolT5-Small |
| Molecule Captioning | L+M-24 | ROUGE-2 | 55.8 | MolT5-Small |
| Molecule Captioning | L+M-24 | ROUGE-L | 54.4 | MolT5-Small |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 85.4 | MolT5-Large |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 30.2 | MolT5-Large |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 1.2 | MolT5-Large |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 16.07 | MolT5-Large |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 83.4 | MolT5-Large |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 68.4 | MolT5-Large |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 770000000 | MolT5-Large |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 74.6 | MolT5-Large |
| Text-based de novo Molecule Generation | ChEBI-20 | Text2Mol | 55.4 | MolT5-Large |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 90.5 | MolT5-Large |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 81 | MolT5-Large-HV |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 31.4 | MolT5-Large-HV |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.44 | MolT5-Large-HV |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 16.758 | MolT5-Large-HV |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 87.2 | MolT5-Large-HV |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 72.2 | MolT5-Large-HV |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 770000000 | MolT5-Large-HV |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 78.6 | MolT5-Large-HV |
| Text-based de novo Molecule Generation | ChEBI-20 | Text2Mol | 59 | MolT5-Large-HV |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 99.6 | MolT5-Large-HV |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 76.9 | MolT5-base |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 8.1 | MolT5-base |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 2.18 | MolT5-base |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 24.458 | MolT5-base |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 72.1 | MolT5-base |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 52.9 | MolT5-base |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 220000000 | MolT5-base |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 58.8 | MolT5-base |
| Text-based de novo Molecule Generation | ChEBI-20 | Text2Mol | 49.6 | MolT5-base |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 77.2 | MolT5-base |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 75.5 | MolT5-small |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 7.9 | MolT5-small |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 2.49 | MolT5-small |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 25.988 | MolT5-small |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 70.3 | MolT5-small |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 51.7 | MolT5-small |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 60000000 | MolT5-small |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 56.8 | MolT5-small |
| Text-based de novo Molecule Generation | ChEBI-20 | Text2Mol | 48.2 | MolT5-small |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 72.1 | MolT5-small |