CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained Model

Wang, Xin, Wang, Yasheng, Zhou, Pingyi, Mi, Fei, Xiao, Meng, Wang, Yadao, Li, Li, Liu, Xiao, Wu, Hao, Liu, Jin, Jiang, Xin

Aug-23-2021–arXiv.org Artificial Intelligence

Code pre-trained models have shown great success in various code-related tasks, such as code search, code clone detection, and code translation. Most existing code pre-trained models often treat a code snippet as a plain sequence of tokens. However, the inherent syntax and hierarchy that provide important structure and semantic information are ignored. The native derived sequence representations of them are insufficient. To this end, we propose CLSEBERT, a Contrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model, to deal with various code intelligence tasks. In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST) and leverage the Contrastive Learning (CL) to learn noise-invariant code representations. Besides the original masked language model (MLM) objective, we also introduce two novel pre-training objectives: (1) ``AST Node Edge Prediction (NEP)'' to predict edges between nodes in the abstract syntax tree; (2) ``Code Token Type Prediction (TTP)'' to predict the types of code tokens. Extensive experiments on four code intelligence tasks demonstrate the superior performance of CLSEBERT compared to state-of-the-art at the same pre-training corpus and parameter scale.

deep learning, neural network, representation, (19 more...)

arXiv.org Artificial Intelligence

Aug-23-2021

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language > Text Processing (0.88)