Description
CuBERT, or Code Understanding BERT, is a BERT based model for code understanding. In order to achieve this, the authors curate a massive corpus of Python programs collected from GitHub. GitHub projects are known to contain a large amount of duplicate code. To avoid biasing the model to such duplicated code, authors perform deduplication using the method of Allamanis (2018). The resulting corpus has 7.4 million files with a total of 9.3 billion tokens (16 million unique).
Papers Using This Method
Intraoperative perfusion assessment by continuous, low-latency hyperspectral light-field imaging: development, methodology, and clinical application2025-04-15SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation2021-08-10Learning and Evaluating Contextual Embedding of Source Code2019-12-21