How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis
Mostafa, Ahmed, Nahid, Raisul Arefin, Mulder, Samuel
–arXiv.org Artificial Intelligence
Abstract--T okenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore prepro-cessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction--a critical problem in binary code analysis. T o this end, we conduct a thorough study on various tokeniza-tion models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokeniza-tion efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows. Tokenization is critical in transforming raw input data into structured representations, a process of utmost importance for Machine Learning (ML) and NLM model tasks [1]-[3]. While tokenization strategies have been studied extensively for natural [4] and high-level programming languages [5], assembly code presents unique challenges due to its low-level operations, diverse instruction sets, and non-standardized syntax across architectures. These challenges highlight the need for specialized tokenization techniques that effectively capture assembly code's structural and semantic intricacies [2]. Despite its importance, the role of tokenization in assembly code processing remains underexplored, particularly in its impact on downstream tasks involving modern NLMs. Recent research underscores the significant influence of tokenization on NLM model performance.
arXiv.org Artificial Intelligence
Nov-7-2025
- Country:
- Asia
- Europe
- Germany > Berlin (0.04)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- North America
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California > San Diego County
- San Diego (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New Mexico > Los Alamos County
- Los Alamos (0.04)
- California > San Diego County
- Mexico > Mexico City
- Oceania > Australia
- Genre:
- Research Report
- Experimental Study (0.48)
- New Finding (0.46)
- Research Report
- Industry:
- Health & Medicine (0.94)
- Information Technology > Security & Privacy (0.67)
- Technology: