CATH 4.3

The CATH (Class, Architecture, Topology, Homology) [65] database is a comprehensive resource for protein structure classification that hierarchical group proteins based on their structural features. The database defines classes based on topological similarities, architectures based on the arrangement of secondary structure elements, topologies based on the connectivity of secondary structure elements, and homologous domains based on sequence similarity. This results in a training set of 16,153 structures, a validation set of 1,457 structures, and a test set of 1,797 structures. Note that the curated CATH dataset contains only single-chain structures and does not consider the case of designing multi-chain proteins.