Goto

Collaborating Authors

 numeric value



How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

Mostafa, Ahmed, Nahid, Raisul Arefin, Mulder, Samuel

arXiv.org Artificial Intelligence

Abstract--T okenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore prepro-cessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction--a critical problem in binary code analysis. T o this end, we conduct a thorough study on various tokeniza-tion models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokeniza-tion efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows. Tokenization is critical in transforming raw input data into structured representations, a process of utmost importance for Machine Learning (ML) and NLM model tasks [1]-[3]. While tokenization strategies have been studied extensively for natural [4] and high-level programming languages [5], assembly code presents unique challenges due to its low-level operations, diverse instruction sets, and non-standardized syntax across architectures. These challenges highlight the need for specialized tokenization techniques that effectively capture assembly code's structural and semantic intricacies [2]. Despite its importance, the role of tokenization in assembly code processing remains underexplored, particularly in its impact on downstream tasks involving modern NLMs. Recent research underscores the significant influence of tokenization on NLM model performance.



Enhancing Cluster Scheduling in HPC: A Continuous Transfer Learning for Real-Time Optimization

Sliwko, Leszek, Mizera-Pietraszko, Jolanta

arXiv.org Artificial Intelligence

This is the accepted version of the paper publis hed in 2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) . Given Name Surname line 2: dept. Given Name Surname line 2: dept. Abstract -- This study presents a machine learning - assisted approach to optimize task scheduling in cluster systems, focusing on node - affinity constraints. Traditional schedulers like Kubernetes struggle with real - time adaptability, whereas the proposed continuous transfer learning model evolves dynamically during operations, minimizing retraining needs. Evaluated on Google Cluster Data, the model achieves over 99% accuracy, reducing computational overhead and improving scheduling latency for constrained tasks. This scalable solution enables real - time optimization, advancing ma chine learning integration in cluster management and paving the way for future adaptive scheduling strategies. In the rapidly evolving landscape of cloud computing and distributed high - performance environments, the efficient management of architectural and software resources became apparently paramount for ensuring suitable performance and minimizing latency. As long as the industry organizations increasingly rely on cluster - based architectures to orchestrate their broad areas of possible applications, the importance of effective task scheduling has come to the forefront . Over the last few years, traditional schedulers, such as Kubernetes and some more, have laid the groundwork for managing containerized workloads; however, it was found that it poses a challenge for them to adapt to the dynamic nature of real - time workloads and node - affinity constraints [ 35 ] . These limitations result in inefficient resource utilization and longer scheduling delays, which ultimately affect overall system performance, especially in high - performance systems [9][18] . In mission - critical environments, these issues can escalate, disrupting vital systems like power networks, healthcare, defen s e systems, and others.


Basis Transformers for Multi-Task Tabular Regression

Loh, Wei Min, Shang, Jiaqi, Poupart, Pascal

arXiv.org Artificial Intelligence

Dealing with tabular data is challenging due to partial information, noise, and heterogeneous structure. Existing techniques often struggle to simultaneously address key aspects of tabular data such as textual information, a variable number of columns, and unseen data without metadata besides column names. We propose a novel architecture, \textit{basis transformers}, specifically designed to tackle these challenges while respecting inherent invariances in tabular data, including hierarchical structure and the representation of numeric values. We evaluate our design on a multi-task tabular regression benchmark, achieving an improvement of 0.338 in the median $R^2$ score and the lowest standard deviation across 34 tasks from the OpenML-CTR23 benchmark. Furthermore, our model has five times fewer parameters than the best-performing baseline and surpasses pretrained large language model baselines -- even when initialized from randomized weights.


Context information can be more important than reasoning for time series forecasting with a large language model

Yang, Janghoon

arXiv.org Artificial Intelligence

With the evolution of large language models (LLMs), there is growing interest in leveraging LLMs for time series tasks. In this paper, we explore the characteristics of LLMs for time series forecasting by considering various existing and proposed prompting techniques. Forecasting for both short and long time series was evaluated. Our findings indicate that no single prompting method is universally applicable. It was also observed that simply providing proper context information related to the time series, without additional reasoning prompts, can achieve performance comparable to the best-performing prompt for each case. From this observation, it is expected that providing proper context information can be more crucial than a prompt for specific reasoning in time series forecasting. Several weaknesses in prompting for time series forecasting were also identified. First, LLMs often fail to follow the procedures described by the prompt. Second, when reasoning steps involve simple algebraic calculations with several operands, LLMs often fail to calculate accurately. Third, LLMs sometimes misunderstand the semantics of prompts, resulting in incomplete responses.


Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

He, Haonan, Ren, Yuchen, Tang, Yining, Xu, Ziyang, Li, Junxian, Yang, Minghao, Zhang, Di, Yuan, Dong, Chen, Tao, Zhang, Shufei, Li, Yuqiang, Dong, Nanqing, Ouyang, Wanli, Zhou, Dongzhan, Ye, Peng

arXiv.org Artificial Intelligence

Large language models have already demonstrated their formidable capabilities in general domains, ushering in a revolutionary transformation. However, exploring and exploiting the extensive knowledge of these models to comprehend multi-omics biology remains underexplored. To fill this research gap, we first introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset including DNA, RNA, proteins, and multi-molecules, designed to bridge the gap between large language models (LLMs) and complex biological sequences-related tasks. This dataset can enhance the versatility of LLMs by integrating diverse biological sequenced-based prediction tasks with advanced reasoning capabilities, while maintaining conversational fluency. Additionally, we reveal significant performance limitations in even state-of-the-art LLMs on biological sequence-related multi-omics tasks without specialized pre-training and instruction-tuning. We further develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline, demonstrating the powerful ability to understand biology by using Biology-Instructions. Biology-Instructions and ChatMultiOmics are publicly available and crucial resources for enabling more effective integration of LLMs with multi-omics sequence analysis.


Characterizing Multimodal Long-form Summarization: A Case Study on Financial Reports

Cao, Tianyu, Raman, Natraj, Dervovic, Danial, Tan, Chenhao

arXiv.org Artificial Intelligence

As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports not only are long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Command. We find that GPT-3.5 and Command fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude has the ability to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4.


Hierarchical Delay Attribution Classification using Unstructured Text in Train Management Systems

Borg, Anton, Lingvall, Per, Svensson, Martin

arXiv.org Artificial Intelligence

EU directives stipulate a systematic follow-up of train delays. In Sweden, the Swedish Transport Administration registers and assigns an appropriate delay attribution code. However, this delay attribution code is assigned manually, which is a complex task. In this paper, a machine learning-based decision support for assigning delay attribution codes based on event descriptions is investigated. The text is transformed using TF-IDF, and two models, Random Forest and Support Vector Machine, are evaluated against a random uniform classifier and the classification performance of the Swedish Transport Administration. Further, the problem is modeled as both a hierarchical and flat approach. The results indicate that a hierarchical approach performs better than a flat approach. Both approaches perform better than the random uniform classifier but perform worse than the manual classification.


Representation Learning on Hyper-Relational and Numeric Knowledge Graphs with Transformers

Chung, Chanyoung, Lee, Jaejun, Whang, Joyce Jiyoung

arXiv.org Artificial Intelligence

A hyper-relational knowledge graph has been recently studied where a triplet is associated with a set of qualifiers; a qualifier is composed of a relation and an entity, providing auxiliary information for a triplet. While existing hyper-relational knowledge graph embedding methods assume that the entities are discrete objects, some information should be represented using numeric values, e.g., (J.R.R., was born in, 1892). Also, a triplet (J.R.R., educated at, Oxford Univ.) can be associated with a qualifier such as (start time, 1911). In this paper, we propose a unified framework named HyNT that learns representations of a hyper-relational knowledge graph containing numeric literals in either triplets or qualifiers. We define a context transformer and a prediction transformer to learn the representations based not only on the correlations between a triplet and its qualifiers but also on the numeric information. By learning compact representations of triplets and qualifiers and feeding them into the transformers, we reduce the computation cost of using transformers. Using HyNT, we can predict missing numeric values in addition to missing entities or relations in a hyper-relational knowledge graph. Experimental results show that HyNT significantly outperforms state-of-the-art methods on real-world datasets.