TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BioT5: Enriching Cross-modal Integration in Biology with C...

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

Qizhi Pei, Wei zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, Rui Yan

2023-10-11Drug DiscoveryText-based de novo Molecule GenerationMolecule Captioning
PaperPDFCode(official)

Abstract

Recent advancements in biological research leverage the integration of molecules, proteins, and natural language to enhance drug discovery. However, current models exhibit several limitations, such as the generation of invalid molecular SMILES, underutilization of contextual information, and equal treatment of structured and unstructured knowledge. To address these issues, we propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular representations and extracts knowledge from the surrounding context of bio-entities in unstructured biological literature. Furthermore, $\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information. After fine-tuning, BioT5 shows superior performance across a wide range of tasks, demonstrating its strong capability of capturing underlying relations and properties of bio-entities. Our code is available at $\href{https://github.com/QizhiPei/BioT5}{Github}$.

Results

TaskDatasetMetricValueModel
Drug DiscoveryChEBI-20BLEU86.7BioT5
Drug DiscoveryChEBI-20Exact Match41.3BioT5
Drug DiscoveryChEBI-20Frechet ChemNet Distance (FCD)0.43BioT5
Drug DiscoveryChEBI-20Levenshtein15.097BioT5
Drug DiscoveryChEBI-20MACCS FTS88.6BioT5
Drug DiscoveryChEBI-20Morgan FTS73.4BioT5
Drug DiscoveryChEBI-20Parameter Count252000000BioT5
Drug DiscoveryChEBI-20RDK FTS80.1BioT5
Drug DiscoveryChEBI-20Text2Mol57.6BioT5
Drug DiscoveryChEBI-20Validity100BioT5
Molecule CaptioningChEBI-20BLEU-263.5BioT5
Molecule CaptioningChEBI-20BLEU-455.6BioT5
Molecule CaptioningChEBI-20METEOR65.6BioT5
Molecule CaptioningChEBI-20ROUGE-169.2BioT5
Molecule CaptioningChEBI-20ROUGE-255.9BioT5
Molecule CaptioningChEBI-20ROUGE-L63.3BioT5
Molecule CaptioningChEBI-20Text2Mol60.3BioT5
Text-based de novo Molecule GenerationChEBI-20BLEU86.7BioT5
Text-based de novo Molecule GenerationChEBI-20Exact Match41.3BioT5
Text-based de novo Molecule GenerationChEBI-20Frechet ChemNet Distance (FCD)0.43BioT5
Text-based de novo Molecule GenerationChEBI-20Levenshtein15.097BioT5
Text-based de novo Molecule GenerationChEBI-20MACCS FTS88.6BioT5
Text-based de novo Molecule GenerationChEBI-20Morgan FTS73.4BioT5
Text-based de novo Molecule GenerationChEBI-20Parameter Count252000000BioT5
Text-based de novo Molecule GenerationChEBI-20RDK FTS80.1BioT5
Text-based de novo Molecule GenerationChEBI-20Text2Mol57.6BioT5
Text-based de novo Molecule GenerationChEBI-20Validity100BioT5

Related Papers

Assay2Mol: large language model-based drug design using BioAssay context2025-07-16A Graph-in-Graph Learning Framework for Drug-Target Interaction Prediction2025-07-15Graph Learning2025-07-08Exploring Modularity of Agentic Systems for Drug Discovery2025-06-27Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design2025-06-26Large Language Model Agent for Modular Task Execution in Drug Discovery2025-06-26PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket Conditioning2025-06-24A standard transformer and attention with linear biases for molecular conformer generation2025-06-24