Vis-CheBI20

ImagesTextsIntroduced 2025-01-26

Molecules represent tokens of the language of chemistry, which underlies not only chemistry itself, but also scientific fields that use chemical information such as pharmacy, material science, and molecular biology. Existing molecular information is distributed across text books, publications, and patents. To describe structural information (spatial arrangement of atoms), molecules are commonly drawn as 2D images in such documents, which makes Optical Chemical Structure Understanding (OCSU) play an important role in molecule-centric scientific discovery.

OCSU aims to automatically translate chemical structure diagrams into chemist-readable or machine-readable strings that describe the molecule from motif level to molecule level and abstract level. Typically, it includes four subtasks, that is, functional group caption, molecular description, chemist-readable IUPAC naming, and machine-readable SMILES naming (OCSR). On the basis of these, molecular structural information can be fully extracted to support downstream tasks, such as moleculecentric chat, property prediction, and molecule editing.

Vis-CheBI20 is released for Optical Chemical Structure Understanding (OCSU) task, built on the basis of CheBI-20 molecule description dataset.