CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained Model

Wang, Xin, Wang, Yasheng, Zhou, Pingyi, Mi, Fei, Xiao, Meng, Wang, Yadao, Li, Li, Liu, Xiao, Wu, Hao, Liu, Jin, Jiang, Xin

arXiv.org Artificial Intelligence 

Code pre-trained models have shown great success in various code-related tasks, such as code search, code clone detection, and code translation. Most existing code pre-trained models often treat a code snippet as a plain sequence of tokens. However, the inherent syntax and hierarchy that provide important structure and semantic information are ignored. The native derived sequence representations of them are insufficient. To this end, we propose CLSEBERT, a Contrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model, to deal with various code intelligence tasks. In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST) and leverage the Contrastive Learning (CL) to learn noise-invariant code representations. Besides the original masked language model (MLM) objective, we also introduce two novel pre-training objectives: (1) ``AST Node Edge Prediction (NEP)'' to predict edges between nodes in the abstract syntax tree; (2) ``Code Token Type Prediction (TTP)'' to predict the types of code tokens. Extensive experiments on four code intelligence tasks demonstrate the superior performance of CLSEBERT compared to state-of-the-art at the same pre-training corpus and parameter scale.