Esim: EVM Bytecode Similarity Detection Based on Stable-Semantic Graph
Chen, Zhuo, Ji, Gaoqiang, He, Yiling, Wu, Lei, Zhou, Yajin
–arXiv.org Artificial Intelligence
Abstract--Decentralized finance (DeFi) is experiencing rapid expansion. However, prevalent code reuse and limited open-source contributions have introduced significant challenges to the blockchain ecosystem, including plagiarism and the propagation of vulnerable code. Consequently, an effective and accurate similarity detection method for EVM bytecode is urgently needed to identify similar contracts. Traditional binary similarity detection methods are typically based on instruction stream or control flow graph (CFG), which have limitations on EVM bytecode due to specific features like low-level EVM bytecode and heavily-reused basic blocks. Moreover, the highly-diverse Solidity Compiler (Solc) versions further complicate accurate similarity detection. Motivated by these challenges, we propose a novel EVM bytecode representation called Stable-Semantic Graph (SSG), which captures relationships between "stable instructions" (special instructions identified by our study). Moreover, we implement a prototype, Esim, which embeds SSG into matrices for similarity detection using a heterogeneous graph neural network. Esim demonstrates high accuracy in SSG construction, achieving F1-scores of 100% for control flow and 95.16% for data flow, and its similarity detection performance reaches 96.3% AUC, surpassing traditional approaches. Our large-scale study, analyzing 2,675,573 smart contracts on six EVM-compatible chains over a one-year period, also demonstrates that Esim outperforms the SOT A tool Etherscan in vulnerability search. With the rapid expansion of decentralized finance (DeFi) in the blockchain ecosystem, DeFi projects, which are built on smart contracts on the Ethereum Virtual Machine (EVM), have attracted substantial investment in recent years, with over $88.8 billion Total V alue Locked (TVL) in 2024 [1]. As a representative case, the Compound v2 protocol [3], one of the top lending protocols, has been widely adopted and forked by numerous DeFi projects. This protocol has a known precision loss issue that can be exploited when the corresponding market lacks liquidity. Since 2022, a series of attacks (e.g., Hundred Finance Attack [4], Onyx Protocol Attack [5], Radiant Attack [6]) have been observed due to the code abuse of Compound v2 protocol, resulting in millions of dollars in losses. Consequently, there is an urgent need for an efficient method to detect code reuse in EVM bytecode (binaries), a process also known as EVM bytecode similarity detection. More than 99% of the Ethereum contracts are not open source [2] In general, binary similarity detection studies in traditional languages (e.g., C++ [7], [8], [9] and Java [10]) can be divided into two categories, i.e., instruction stream based and control flow graph (CFG) based.
arXiv.org Artificial Intelligence
Nov-18-2025
- Genre:
- Research Report > New Finding (0.93)
- Industry:
- Banking & Finance > Trading (1.00)
- Information Technology > Security & Privacy (1.00)
- Technology: