basic block
Esim: EVM Bytecode Similarity Detection Based on Stable-Semantic Graph
Chen, Zhuo, Ji, Gaoqiang, He, Yiling, Wu, Lei, Zhou, Yajin
Abstract--Decentralized finance (DeFi) is experiencing rapid expansion. However, prevalent code reuse and limited open-source contributions have introduced significant challenges to the blockchain ecosystem, including plagiarism and the propagation of vulnerable code. Consequently, an effective and accurate similarity detection method for EVM bytecode is urgently needed to identify similar contracts. Traditional binary similarity detection methods are typically based on instruction stream or control flow graph (CFG), which have limitations on EVM bytecode due to specific features like low-level EVM bytecode and heavily-reused basic blocks. Moreover, the highly-diverse Solidity Compiler (Solc) versions further complicate accurate similarity detection. Motivated by these challenges, we propose a novel EVM bytecode representation called Stable-Semantic Graph (SSG), which captures relationships between "stable instructions" (special instructions identified by our study). Moreover, we implement a prototype, Esim, which embeds SSG into matrices for similarity detection using a heterogeneous graph neural network. Esim demonstrates high accuracy in SSG construction, achieving F1-scores of 100% for control flow and 95.16% for data flow, and its similarity detection performance reaches 96.3% AUC, surpassing traditional approaches. Our large-scale study, analyzing 2,675,573 smart contracts on six EVM-compatible chains over a one-year period, also demonstrates that Esim outperforms the SOT A tool Etherscan in vulnerability search. With the rapid expansion of decentralized finance (DeFi) in the blockchain ecosystem, DeFi projects, which are built on smart contracts on the Ethereum Virtual Machine (EVM), have attracted substantial investment in recent years, with over $88.8 billion Total V alue Locked (TVL) in 2024 [1]. As a representative case, the Compound v2 protocol [3], one of the top lending protocols, has been widely adopted and forked by numerous DeFi projects. This protocol has a known precision loss issue that can be exploited when the corresponding market lacks liquidity. Since 2022, a series of attacks (e.g., Hundred Finance Attack [4], Onyx Protocol Attack [5], Radiant Attack [6]) have been observed due to the code abuse of Compound v2 protocol, resulting in millions of dollars in losses. Consequently, there is an urgent need for an efficient method to detect code reuse in EVM bytecode (binaries), a process also known as EVM bytecode similarity detection. More than 99% of the Ethereum contracts are not open source [2] In general, binary similarity detection studies in traditional languages (e.g., C++ [7], [8], [9] and Java [10]) can be divided into two categories, i.e., instruction stream based and control flow graph (CFG) based.
- Information Technology > Security & Privacy (1.00)
- Banking & Finance > Trading (1.00)
Supplementary Material for Paper " Terra: Imperative-Symbolic Co-Execution of Imperative Deep Learning Programs " A Criteria for Node Equality When Merging Traces
TraceGraph, it compares the type, attributes, and the executed location of each operation. For example, the MatMul operation of TensorFlow has ' MatMul ' as GraphGenerator fails to match because of the different attributes. The pushed call id is popped when the function is returned. As same as the call id stack, Terra manages the loop id stack for the entire program execution. Current implementation of Terra does not consider multi-threading yet.
Explainable Attention-Guided Stacked Graph Neural Networks for Malware Detection
Shokouhinejad, Hossein, Razavi-Far, Roozbeh, Higgins, Griffin, Ghorbani, Ali A
Malware detection in modern computing environments demands models that are not only accurate but also interpretable and robust to evasive techniques. Graph neural networks (GNNs) have shown promise in this domain by modeling rich structural dependencies in graph-based program representations such as control flow graphs (CFGs). However, single-model approaches may suffer from limited generalization and lack interpretability, especially in high-stakes security applications. In this paper, we propose a novel stacking ensemble framework for graph-based malware detection and explanation. Our method dynamically extracts CFGs from portable executable (PE) files and encodes their basic blocks through a two-step embedding strategy. A set of diverse GNN base learners, each with a distinct message-passing mechanism, is used to capture complementary behavioral features. Their prediction outputs are aggregated by a meta-learner implemented as an attention-based multilayer perceptron, which both classifies malware instances and quantifies the contribution of each base model. To enhance explainability, we introduce an ensemble-aware post-hoc explanation technique that leverages edge-level importance scores generated by a GNN explainer and fuses them using the learned attention weights. This produces interpretable, model-agnostic explanations aligned with the final ensemble decision. Experimental results demonstrate that our framework improves classification performance while providing insightful interpretations of malware behavior.
- North America > Canada > New Brunswick > York County > Fredericton (0.04)
- North America > Canada > New Brunswick > Fredericton (0.04)
ZKTorch: Compiling ML Inference to Zero-Knowledge Proofs via Parallel Proof Accumulation
Chen, Bing-Jyue, Tang, Lilia, Kang, Daniel
As AI models become ubiquitous in our daily lives, there has been an increasing demand for transparency in ML services. However, the model owner does not want to reveal the weights, as they are considered trade secrets. To solve this problem, researchers have turned to zero-knowledge proofs of ML model inference. These proofs convince the user that the ML model output is correct, without revealing the weights of the model to the user. Past work on these provers can be placed into two categories. The first method compiles the ML model into a low-level circuit, and proves the circuit using a ZK-SNARK. The second method uses custom cryptographic protocols designed only for a specific class of models. Unfortunately, the first method is highly inefficient, making it impractical for the large models used today, and the second method does not generalize well, making it difficult to update in the rapidly changing field of machine learning. To solve this, we propose ZKTorch, an open source end-to-end proving system that compiles ML models into base cryptographic operations called basic blocks, each proved using specialized protocols. ZKTorch is built on top of a novel parallel extension to the Mira accumulation scheme, enabling succinct proofs with minimal accumulation overhead. These contributions allow ZKTorch to achieve at least a $3\times$ reduction in the proof size compared to specialized protocols and up to a $6\times$ speedup in proving time over a general-purpose ZKML framework.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > Illinois (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- (2 more...)
LEGO-Compiler: Enhancing Neural Compilation Through Translation Composability
Zhang, Shuoming, Zhao, Jiacheng, Xia, Chunwei, Wang, Zheng, Chen, Yunji, Feng, Xiaobing, Cui, Huimin
Large language models (LLMs) have the potential to revolutionize how we design and implement compilers and code translation tools. However, existing LLMs struggle to handle long and complex programs. We introduce LEGO-Compiler, a novel neural compilation system that leverages LLMs to translate high-level languages into assembly code. Our approach centers on three key innovations: LEGO translation, which decomposes the input program into manageable blocks; breaking down the complex compilation process into smaller, simpler verifiable steps by organizing it as a verifiable LLM workflow by external tests; and a feedback mechanism for self-correction. Supported by formal proofs of translation composability, LEGO-Compiler demonstrates high accuracy on multiple datasets, including over 99% on ExeBench and 97.9% on industrial-grade AnsiBench. Additionally, LEGO-Compiler has also acheived near one order-of-magnitude improvement on compilable code size scalability. This work opens new avenues for applying LLMs to system-level tasks, complementing traditional compiler technologies.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (12 more...)
- Research Report (1.00)
- Workflow (0.68)
PrETi: Predicting Execution Time in Early Stage with LLVM and Machine Learning
Xu, Risheng, Sieweck, Philipp, von Hasseln, Hermann, Nowotka, Dirk
We introduce preti, a novel framework for predicting software execution time during the early stages of development. preti leverages an LLVM-based simulation environment to extract timing-related runtime information, such as the count of executed LLVM IR instructions. This information, combined with historical execution time data, is utilized to train machine learning models for accurate time prediction. To further enhance prediction accuracy, our approach incorporates simulations of cache accesses and branch prediction. The evaluations on public benchmarks demonstrate that preti achieves an average Absolute Percentage Error (APE) of 11.98\%, surpassing state-of-the-art methods. These results underscore the effectiveness and efficiency of preti as a robust solution for early-stage timing analysis.
- Europe > Germany > Schleswig-Holstein > Kiel (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.34)
Towards LLM-based optimization compilers. Can LLMs learn how to apply a single peephole optimization? Reasoning is all LLMs need!
Large Language Models (LLMs) have demonstrated great potential in various language processing tasks, and recent studies have explored their application in compiler optimizations. However, all these studies focus on the conventional open-source LLMs, such as Llama2, which lack enhanced reasoning mechanisms. In this study, we investigate the errors produced by the fine-tuned 7B-parameter Llama2 model as it attempts to learn and apply a simple peephole optimization for the AArch64 assembly code. We provide an analysis of the errors produced by the LLM and compare it with state-of-the-art OpenAI models which implement advanced reasoning logic, including GPT-4o and GPT-o1 (preview). We demonstrate that OpenAI GPT-o1, despite not being fine-tuned, outperforms the fine-tuned Llama2 and GPT-4o. Our findings indicate that this advantage is largely due to the chain-of-thought reasoning implemented in GPT-o1. We hope our work will inspire further research on using LLMs with enhanced reasoning mechanisms and chain-of-thought for code generation and optimization.
- Europe > United Kingdom > England > Greater London > London (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
FuzzDistill: Intelligent Fuzzing Target Selection using Compile-Time Analysis and Machine Learning
--Fuzz testing is a fundamental technique employed to identify vulnerabilities within software systems. However, the process can be protracted and resource-intensive, especially when confronted with extensive codebases. In this work, I present FuzzDistill, an approach that harnesses compile-time data and machine learning to refine fuzzing targets. By analyzing compile-time information, such as function call graphs' features, loop information, and memory operations, FuzzDistill identifies high-priority areas of the codebase that are more probable to contain vulnerabilities. I demonstrate the efficacy of my approach through experiments conducted on real-world software, demonstrating substantial reductions in testing time. Fuzz testing is a critical technique for identifying vulnerabilities in software by subjecting programs to random or semi-random inputs. As a result, large portions of the code are left unexplored, and significant vulnerabilities can go undetected.
- North America > United States > Virginia (0.05)
- Asia > Singapore > Central Region > Singapore (0.04)
- Information Technology > Security & Privacy (0.47)
- Government > Military (0.41)
Deep Learning-Based Channel Squeeze U-Structure for Lung Nodule Detection and Segmentation
Sui, Mingxiu, Hu, Jiacheng, Zhou, Tong, Liu, Zibo, Wen, Likang, Du, Junliang
This paper introduces a novel deep-learning method for the automatic detection and segmentation of lung nodules, aimed at advancing the accuracy of early-stage lung cancer diagnosis. The proposed approach leverages a unique "Channel Squeeze U-Structure" that optimizes feature extraction and information integration across multiple semantic levels of the network. This architecture includes three key modules: shallow information processing, channel residual structure, and channel squeeze integration. These modules enhance the model's ability to detect and segment small, imperceptible, or ground-glass nodules, which are critical for early diagnosis. The method demonstrates superior performance in terms of sensitivity, Dice similarity coefficient, precision, and mean Intersection over Union (IoU). Extensive experiments were conducted on the Lung Image Database Consortium (LIDC) dataset using five-fold cross-validation, showing excellent stability and robustness. The results indicate that this approach holds significant potential for improving computer-aided diagnosis systems, providing reliable support for radiologists in clinical practice and aiding in the early detection of lung cancer, especially in resource-limited settings
- North America > United States > Iowa (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)