MIPE
Improving Paratope and Epitope Prediction by Multi-Modal Contrastive Learning and Interaction Informativeness Estimation
Datasets. From the publicly accessible Structural Antibody Database (SAbDab), we collected a total of 7571 antibodyantigen complexes, with the sequence data in FASTA format and structural data in PDB format. Following previous studies [Pittala and Bailey-Kellogg, 2020], we used CD-HIT [Li and Godzik, 2006] to remove high-homology antibody and antigen sequences with the thresholds of 95% and 90% sequence identity, respectively. Subsequently, we excluded antibodies and antigens with any residue type rather than 20 naturally occurring types. Finally, we compiled a dataset consisting of 626 binding antibody-antigen pairs, including their sequences, structures, and corresponding interaction maps. Noteworthy, antibodies primarily bind to antigens through their CDR regions. Most researchers use Euclidean distance to define paratopes and epitopes, and we follow the usual way in our dataset: within the CDR regions/antigen, a residue is labeled as a paratope/epitope if the Euclidean distance between its backbone atom and any backbone atom on the other antigen/CDR regions is less than 4.5 ˚ A.