Dimitrios Christofidellis, Giorgio Giannone, Jannis Born, Ole Winther, Teodoro Laino, Matteo Manica
The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains. Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Drug Discovery | ChEBI-20 | BLEU | 85.3 | Text+Chem T5-augm base |
| Drug Discovery | ChEBI-20 | Exact Match | 32.2 | Text+Chem T5-augm base |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.05 | Text+Chem T5-augm base |
| Drug Discovery | ChEBI-20 | Levenshtein | 16.87 | Text+Chem T5-augm base |
| Drug Discovery | ChEBI-20 | MACCS FTS | 90.1 | Text+Chem T5-augm base |
| Drug Discovery | ChEBI-20 | Morgan FTS | 75.7 | Text+Chem T5-augm base |
| Drug Discovery | ChEBI-20 | Parameter Count | 220000000 | Text+Chem T5-augm base |
| Drug Discovery | ChEBI-20 | RDK FTS | 81.6 | Text+Chem T5-augm base |
| Drug Discovery | ChEBI-20 | Validity | 94.3 | Text+Chem T5-augm base |
| Drug Discovery | ChEBI-20 | BLEU | 81.5 | Text+Chem T5-augm small |
| Drug Discovery | ChEBI-20 | Exact Match | 19.1 | Text+Chem T5-augm small |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.06 | Text+Chem T5-augm small |
| Drug Discovery | ChEBI-20 | Levenshtein | 21.78 | Text+Chem T5-augm small |
| Drug Discovery | ChEBI-20 | MACCS FTS | 86.4 | Text+Chem T5-augm small |
| Drug Discovery | ChEBI-20 | Morgan FTS | 67.2 | Text+Chem T5-augm small |
| Drug Discovery | ChEBI-20 | Parameter Count | 60000000 | Text+Chem T5-augm small |
| Drug Discovery | ChEBI-20 | RDK FTS | 74.4 | Text+Chem T5-augm small |
| Drug Discovery | ChEBI-20 | Validity | 95.1 | Text+Chem T5-augm small |
| Drug Discovery | ChEBI-20 | BLEU | 75 | Text+Chem T5 base |
| Drug Discovery | ChEBI-20 | Exact Match | 21.2 | Text+Chem T5 base |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.061 | Text+Chem T5 base |
| Drug Discovery | ChEBI-20 | Levenshtein | 27.39 | Text+Chem T5 base |
| Drug Discovery | ChEBI-20 | MACCS FTS | 87.4 | Text+Chem T5 base |
| Drug Discovery | ChEBI-20 | Morgan FTS | 69.7 | Text+Chem T5 base |
| Drug Discovery | ChEBI-20 | Parameter Count | 220000000 | Text+Chem T5 base |
| Drug Discovery | ChEBI-20 | RDK FTS | 76.7 | Text+Chem T5 base |
| Drug Discovery | ChEBI-20 | Validity | 79.2 | Text+Chem T5 base |
| Drug Discovery | ChEBI-20 | BLEU | 73.9 | Text+Chem T5 small |
| Drug Discovery | ChEBI-20 | Exact Match | 15.7 | Text+Chem T5 small |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.066 | Text+Chem T5 small |
| Drug Discovery | ChEBI-20 | Levenshtein | 28.54 | Text+Chem T5 small |
| Drug Discovery | ChEBI-20 | MACCS FTS | 85.9 | Text+Chem T5 small |
| Drug Discovery | ChEBI-20 | Morgan FTS | 66 | Text+Chem T5 small |
| Drug Discovery | ChEBI-20 | Parameter Count | 60000000 | Text+Chem T5 small |
| Drug Discovery | ChEBI-20 | RDK FTS | 73.6 | Text+Chem T5 small |
| Drug Discovery | ChEBI-20 | Validity | 77.6 | Text+Chem T5 small |
| Molecule Captioning | ChEBI-20 | BLEU-2 | 62.5 | Text+Chem T5-augm-Base |
| Molecule Captioning | ChEBI-20 | BLEU-4 | 54.2 | Text+Chem T5-augm-Base |
| Molecule Captioning | ChEBI-20 | METEOR | 64.8 | Text+Chem T5-augm-Base |
| Molecule Captioning | ChEBI-20 | ROUGE-1 | 68.2 | Text+Chem T5-augm-Base |
| Molecule Captioning | ChEBI-20 | ROUGE-2 | 54.3 | Text+Chem T5-augm-Base |
| Molecule Captioning | ChEBI-20 | ROUGE-L | 62.2 | Text+Chem T5-augm-Base |
| Molecule Captioning | ChEBI-20 | BLEU-2 | 58 | Text+Chem T5-Base |
| Molecule Captioning | ChEBI-20 | BLEU-4 | 49 | Text+Chem T5-Base |
| Molecule Captioning | ChEBI-20 | METEOR | 60.4 | Text+Chem T5-Base |
| Molecule Captioning | ChEBI-20 | ROUGE-1 | 64.7 | Text+Chem T5-Base |
| Molecule Captioning | ChEBI-20 | ROUGE-2 | 49.8 | Text+Chem T5-Base |
| Molecule Captioning | ChEBI-20 | ROUGE-L | 58.6 | Text+Chem T5-Base |
| Molecule Captioning | ChEBI-20 | BLEU-2 | 56 | Text+Chem T5-augm-Small |
| Molecule Captioning | ChEBI-20 | BLEU-4 | 47 | Text+Chem T5-augm-Small |
| Molecule Captioning | ChEBI-20 | METEOR | 58.8 | Text+Chem T5-augm-Small |
| Molecule Captioning | ChEBI-20 | ROUGE-1 | 63.8 | Text+Chem T5-augm-Small |
| Molecule Captioning | ChEBI-20 | ROUGE-2 | 48.8 | Text+Chem T5-augm-Small |
| Molecule Captioning | ChEBI-20 | ROUGE-L | 58 | Text+Chem T5-augm-Small |
| Molecule Captioning | ChEBI-20 | BLEU-2 | 55.3 | Text+Chem T5-Small |
| Molecule Captioning | ChEBI-20 | BLEU-4 | 46.2 | Text+Chem T5-Small |
| Molecule Captioning | ChEBI-20 | METEOR | 58.3 | Text+Chem T5-Small |
| Molecule Captioning | ChEBI-20 | ROUGE-1 | 63.3 | Text+Chem T5-Small |
| Molecule Captioning | ChEBI-20 | ROUGE-2 | 48.1 | Text+Chem T5-Small |
| Molecule Captioning | ChEBI-20 | ROUGE-L | 57.4 | Text+Chem T5-Small |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 85.3 | Text+Chem T5-augm base |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 32.2 | Text+Chem T5-augm base |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.05 | Text+Chem T5-augm base |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 16.87 | Text+Chem T5-augm base |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 90.1 | Text+Chem T5-augm base |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 75.7 | Text+Chem T5-augm base |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 220000000 | Text+Chem T5-augm base |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 81.6 | Text+Chem T5-augm base |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 94.3 | Text+Chem T5-augm base |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 81.5 | Text+Chem T5-augm small |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 19.1 | Text+Chem T5-augm small |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.06 | Text+Chem T5-augm small |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 21.78 | Text+Chem T5-augm small |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 86.4 | Text+Chem T5-augm small |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 67.2 | Text+Chem T5-augm small |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 60000000 | Text+Chem T5-augm small |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 74.4 | Text+Chem T5-augm small |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 95.1 | Text+Chem T5-augm small |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 75 | Text+Chem T5 base |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 21.2 | Text+Chem T5 base |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.061 | Text+Chem T5 base |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 27.39 | Text+Chem T5 base |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 87.4 | Text+Chem T5 base |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 69.7 | Text+Chem T5 base |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 220000000 | Text+Chem T5 base |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 76.7 | Text+Chem T5 base |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 79.2 | Text+Chem T5 base |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 73.9 | Text+Chem T5 small |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 15.7 | Text+Chem T5 small |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.066 | Text+Chem T5 small |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 28.54 | Text+Chem T5 small |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 85.9 | Text+Chem T5 small |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 66 | Text+Chem T5 small |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 60000000 | Text+Chem T5 small |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 73.6 | Text+Chem T5 small |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 77.6 | Text+Chem T5 small |