Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, Rui Yan
Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Drug Discovery | ChEBI-20 | BLEU | 87.2 | BioT5+ |
| Drug Discovery | ChEBI-20 | Exact Match | 52.2 | BioT5+ |
| Drug Discovery | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.353 | BioT5+ |
| Drug Discovery | ChEBI-20 | Levenshtein | 12.776 | BioT5+ |
| Drug Discovery | ChEBI-20 | MACCS FTS | 90.7 | BioT5+ |
| Drug Discovery | ChEBI-20 | Morgan FTS | 77.9 | BioT5+ |
| Drug Discovery | ChEBI-20 | Parameter Count | 252000000 | BioT5+ |
| Drug Discovery | ChEBI-20 | RDK FTS | 83.5 | BioT5+ |
| Drug Discovery | ChEBI-20 | Text2Mol | 57.9 | BioT5+ |
| Drug Discovery | ChEBI-20 | Validity | 100 | BioT5+ |
| Molecule Captioning | ChEBI-20 | BLEU-2 | 66.6 | BioT5+ |
| Molecule Captioning | ChEBI-20 | BLEU-4 | 59.1 | BioT5+ |
| Molecule Captioning | ChEBI-20 | METEOR | 68.1 | BioT5+ |
| Molecule Captioning | ChEBI-20 | ROUGE-1 | 71 | BioT5+ |
| Molecule Captioning | ChEBI-20 | ROUGE-2 | 58.4 | BioT5+ |
| Molecule Captioning | ChEBI-20 | ROUGE-L | 65 | BioT5+ |
| Forward reaction prediction | Mol-Instruction | Exact | 0.864 | BioT5+ |
| Forward reaction prediction | Mol-Instruction | Morgan FTS | 0.935 | BioT5+ |
| Forward reaction prediction | Mol-Instruction | Validity | 1 | BioT5+ |
| Reagent Prediction | Mol-Instruction | Exact | 0.257 | BioT5+ |
| Reagent Prediction | Mol-Instruction | Morgan FTS | 0.512 | BioT5+ |
| Reagent Prediction | Mol-Instruction | Validity | 1 | BioT5+ |
| Retrosynthesis | Mol-Instruction | Exact | 0.642 | BioT5+ |
| Retrosynthesis | Mol-Instruction | Morgan FTS | 0.866 | BioT5+ |
| Retrosynthesis | Mol-Instruction | Validity | 1 | BioT5+ |
| Text-based de novo Molecule Generation | ChEBI-20 | BLEU | 87.2 | BioT5+ |
| Text-based de novo Molecule Generation | ChEBI-20 | Exact Match | 52.2 | BioT5+ |
| Text-based de novo Molecule Generation | ChEBI-20 | Frechet ChemNet Distance (FCD) | 0.353 | BioT5+ |
| Text-based de novo Molecule Generation | ChEBI-20 | Levenshtein | 12.776 | BioT5+ |
| Text-based de novo Molecule Generation | ChEBI-20 | MACCS FTS | 90.7 | BioT5+ |
| Text-based de novo Molecule Generation | ChEBI-20 | Morgan FTS | 77.9 | BioT5+ |
| Text-based de novo Molecule Generation | ChEBI-20 | Parameter Count | 252000000 | BioT5+ |
| Text-based de novo Molecule Generation | ChEBI-20 | RDK FTS | 83.5 | BioT5+ |
| Text-based de novo Molecule Generation | ChEBI-20 | Text2Mol | 57.9 | BioT5+ |
| Text-based de novo Molecule Generation | ChEBI-20 | Validity | 100 | BioT5+ |