AITopics | dobf

Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source files, and to suggest descriptive variable names.

machine learning, natural language, proceedings, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.85)

Add feedback

Table 3: Dataset statistics. Java Python All - Size 26 GB19 GB All - Nb files 7.9M 3.6M Av. nb of tokens / file 718 1245 Av. nb of identifiers / file 25.9 41.8 Input Code Proposed Function Name

Neural Information Processing SystemsAug-15-2025, 10:32:55 GMT

DOBF finds relevant names for Java methods without copying any part of the other tokens. For instance, in the first example, it understands that this function is used to get environment variables. Although l and a are not very informative, they indicate that the variable is a list or an array. These three functions perform graph traversals. Linear(input_size, ( 4 * hidden_size), bias = bias) self .

assert len, input code proposed function name, neighbour, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.74)

Add feedback

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages Marie-Anne Lachaux

Neural Information Processing SystemsAug-15-2025, 10:32:50 GMT

In the original approach proposed by Devlin et al. [2018], a fraction of selected masked words is

arxiv preprint arxiv, identifier name, objective, (13 more...)

Neural Information Processing Systems

Country: Europe > France (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

Neural Information Processing SystemsOct-11-2024, 10:28:34 GMT

Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source files, and to suggest descriptive variable names.

deobfuscation pre-training objective, dobf, programming language, (1 more...)

Neural Information Processing Systems

Genre: Instructional Material (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback