MolFM: A Multimodal Molecular Foundation Model

Yizhen Luo, Kai Yang, Massimo Hong, Xing Yi Liu, Zaiqing Nie

2023-06-06Cross-Modal Retrieval Knowledge Graphs Representation Learning Text-based de novo Molecule Generation Retrieval Molecule Captioning

Paper PDF Code Code(official)

Abstract

Molecular knowledge resides within three different modalities of information sources: molecular structures, biomedical documents, and knowledge bases. Effective incorporation of molecular knowledge from these modalities holds paramount significance in facilitating biomedical research. However, existing multimodal molecular foundation models exhibit limitations in capturing intricate connections between molecular structures and texts, and more importantly, none of them attempt to leverage a wealth of molecular expertise derived from knowledge graphs. In this study, we introduce MolFM, a multimodal molecular foundation model designed to facilitate joint representation learning from molecular structures, biomedical texts, and knowledge graphs. We propose cross-modal attention between atoms of molecular structures, neighbors of molecule entities and semantically related texts to facilitate cross-modal comprehension. We provide theoretical analysis that our cross-modal pre-training captures local and global molecular knowledge by minimizing the distance in the feature space between different modalities of the same molecule, as well as molecules sharing similar structures or functions. MolFM achieves state-of-the-art performance on various downstream tasks. On cross-modal retrieval, MolFM outperforms existing models with 12.13% and 5.04% absolute gains under the zero-shot and fine-tuning settings, respectively. Furthermore, qualitative analysis showcases MolFM's implicit ability to provide grounding from molecular substructures and knowledge graphs. Code and models are available on https://github.com/BioFM/OpenBioMed.

Results

Task	Dataset	Metric	Value	Model
Drug Discovery	ChEBI-20	BLEU	82.2	MolFM-Base
Drug Discovery	ChEBI-20	Exact Match	21	MolFM-Base
Drug Discovery	ChEBI-20	Levenshtein	19.445	MolFM-Base
Drug Discovery	ChEBI-20	MACCS FTS	85.4	MolFM-Base
Drug Discovery	ChEBI-20	Morgan FTS	75.8	MolFM-Base
Drug Discovery	ChEBI-20	Parameter Count	296200000	MolFM-Base
Drug Discovery	ChEBI-20	RDK FTS	69.7	MolFM-Base
Drug Discovery	ChEBI-20	Text2Mol	58.3	MolFM-Base
Drug Discovery	ChEBI-20	Validity	89.2	MolFM-Base
Drug Discovery	ChEBI-20	BLEU	80.3	MolFM-Small
Drug Discovery	ChEBI-20	Exact Match	16.9	MolFM-Small
Drug Discovery	ChEBI-20	Levenshtein	20.868	MolFM-Small
Drug Discovery	ChEBI-20	MACCS FTS	83.4	MolFM-Small
Drug Discovery	ChEBI-20	Morgan FTS	72.1	MolFM-Small
Drug Discovery	ChEBI-20	Parameter Count	13620000	MolFM-Small
Drug Discovery	ChEBI-20	RDK FTS	66.2	MolFM-Small
Drug Discovery	ChEBI-20	Text2Mol	57.3	MolFM-Small
Drug Discovery	ChEBI-20	Validity	85.9	MolFM-Small
Molecule Captioning	ChEBI-20	BLEU-2	58.5	MolFM-Base
Molecule Captioning	ChEBI-20	BLEU-4	49.8	MolFM-Base
Molecule Captioning	ChEBI-20	METEOR	60.7	MolFM-Base
Molecule Captioning	ChEBI-20	ROUGE-1	65.3	MolFM-Base
Molecule Captioning	ChEBI-20	ROUGE-2	50.8	MolFM-Base
Molecule Captioning	ChEBI-20	ROUGE-L	59.4	MolFM-Base
Molecule Captioning	ChEBI-20	Text2Mol	57.6	MolFM-Base
Molecule Captioning	ChEBI-20	BLEU-2	54.2	MolFM-Small
Molecule Captioning	ChEBI-20	BLEU-4	45.2	MolFM-Small
Molecule Captioning	ChEBI-20	METEOR	56.4	MolFM-Small
Molecule Captioning	ChEBI-20	ROUGE-1	62.3	MolFM-Small
Molecule Captioning	ChEBI-20	ROUGE-2	46.9	MolFM-Small
Molecule Captioning	ChEBI-20	ROUGE-L	56.2	MolFM-Small
Molecule Captioning	ChEBI-20	Text2Mol	55.7	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	BLEU	82.2	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Exact Match	21	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Levenshtein	19.445	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	MACCS FTS	85.4	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Morgan FTS	75.8	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Parameter Count	296200000	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	RDK FTS	69.7	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Text2Mol	58.3	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Validity	89.2	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	BLEU	80.3	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Exact Match	16.9	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Levenshtein	20.868	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	MACCS FTS	83.4	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Morgan FTS	72.1	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Parameter Count	13620000	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	RDK FTS	66.2	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Text2Mol	57.3	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Validity	85.9	MolFM-Small

Abstract

Results

Task	Dataset	Metric	Value	Model
Drug Discovery	ChEBI-20	BLEU	82.2	MolFM-Base
Drug Discovery	ChEBI-20	Exact Match	21	MolFM-Base
Drug Discovery	ChEBI-20	Levenshtein	19.445	MolFM-Base
Drug Discovery	ChEBI-20	MACCS FTS	85.4	MolFM-Base
Drug Discovery	ChEBI-20	Morgan FTS	75.8	MolFM-Base
Drug Discovery	ChEBI-20	Parameter Count	296200000	MolFM-Base
Drug Discovery	ChEBI-20	RDK FTS	69.7	MolFM-Base
Drug Discovery	ChEBI-20	Text2Mol	58.3	MolFM-Base
Drug Discovery	ChEBI-20	Validity	89.2	MolFM-Base
Drug Discovery	ChEBI-20	BLEU	80.3	MolFM-Small
Drug Discovery	ChEBI-20	Exact Match	16.9	MolFM-Small
Drug Discovery	ChEBI-20	Levenshtein	20.868	MolFM-Small
Drug Discovery	ChEBI-20	MACCS FTS	83.4	MolFM-Small
Drug Discovery	ChEBI-20	Morgan FTS	72.1	MolFM-Small
Drug Discovery	ChEBI-20	Parameter Count	13620000	MolFM-Small
Drug Discovery	ChEBI-20	RDK FTS	66.2	MolFM-Small
Drug Discovery	ChEBI-20	Text2Mol	57.3	MolFM-Small
Drug Discovery	ChEBI-20	Validity	85.9	MolFM-Small
Molecule Captioning	ChEBI-20	BLEU-2	58.5	MolFM-Base
Molecule Captioning	ChEBI-20	BLEU-4	49.8	MolFM-Base
Molecule Captioning	ChEBI-20	METEOR	60.7	MolFM-Base
Molecule Captioning	ChEBI-20	ROUGE-1	65.3	MolFM-Base
Molecule Captioning	ChEBI-20	ROUGE-2	50.8	MolFM-Base
Molecule Captioning	ChEBI-20	ROUGE-L	59.4	MolFM-Base
Molecule Captioning	ChEBI-20	Text2Mol	57.6	MolFM-Base
Molecule Captioning	ChEBI-20	BLEU-2	54.2	MolFM-Small
Molecule Captioning	ChEBI-20	BLEU-4	45.2	MolFM-Small
Molecule Captioning	ChEBI-20	METEOR	56.4	MolFM-Small
Molecule Captioning	ChEBI-20	ROUGE-1	62.3	MolFM-Small
Molecule Captioning	ChEBI-20	ROUGE-2	46.9	MolFM-Small
Molecule Captioning	ChEBI-20	ROUGE-L	56.2	MolFM-Small
Molecule Captioning	ChEBI-20	Text2Mol	55.7	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	BLEU	82.2	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Exact Match	21	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Levenshtein	19.445	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	MACCS FTS	85.4	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Morgan FTS	75.8	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Parameter Count	296200000	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	RDK FTS	69.7	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Text2Mol	58.3	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	Validity	89.2	MolFM-Base
Text-based de novo Molecule Generation	ChEBI-20	BLEU	80.3	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Exact Match	16.9	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Levenshtein	20.868	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	MACCS FTS	83.4	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Morgan FTS	72.1	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Parameter Count	13620000	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	RDK FTS	66.2	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Text2Mol	57.3	MolFM-Small
Text-based de novo Molecule Generation	ChEBI-20	Validity	85.9	MolFM-Small

MolFM: A Multimodal Molecular Foundation Model

Abstract

Results

Related Papers

MolFM: A Multimodal Molecular Foundation Model

Abstract

Results

Related Papers