DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Mar-20-2025, 00:32:18 GMT–Neural Information Processing Systems

Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pretraining has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source files, and to suggest descriptive variable names.

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Mar-20-2025, 00:32:18 GMT

Conferences PDF

Add feedback

Genre:
- Instructional Material (0.61)
- Research Report (0.93)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language (1.00)

Duplicate Docs Excel Report

Title
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Similar Docs Excel Report more

Title	Similarity	Source
None found