Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding

Wang, Deze, Jia, Zhouyang, Li, Shanshan, Yu, Yue, Xiong, Yun, Dong, Wei, Liao, Xiangke

Dec-4-2021–arXiv.org Artificial Intelligence

With the great success of pre-trained models, the pretrain-then-finetune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to organize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models. We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Dec-4-2021

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.29)
- North America > United States (0.30)

Genre:
- Research Report
  - New Finding (0.48)
  - Promising Solution (0.34)

Industry:
- Education (0.67)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)
    - Natural Language (1.00)
    - Representation & Reasoning (0.93)
  - Sensing and Signal Processing > Image Processing (1.00)
  - Software Engineering (1.00)