OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification
Li, Shangyu, Jiang, Juyong, Zhao, Tiancheng, Shen, Jiasi
–arXiv.org Artificial Intelligence
We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) on the task of generating complete formal specifications for verifying the functional correctness of operating system kernels. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each of which is a long-context task of about 20k-30k tokens. The benchmark formulates the specification generation task as a program synthesis problem confined to a domain for specifying states and transitions. This formulation is provided to LLMs through a programming model. The LLMs must be able to understand the programming model and verification assumptions before delineating the correct search space for syntax and semantics and generating formal specifications. Guided by the operating system's high-level functional description, the LLMs are asked to generate a specification that fully describes all correct states and transitions for a potentially buggy code implementation of the operating system. Experimental results with 12 state-of-the-art LLMs indicate limited performance of existing LLMs on the specification generation task for operating system verification. Significant disparities in their performance highlight differences in their ability to handle long-context code generation tasks. The code are available at https://github.com/lishangyu-hkust/OSVBench
arXiv.org Artificial Intelligence
Dec-9-2025
- Country:
- Asia > China
- Guangdong Province > Guangzhou (0.04)
- Hong Kong (0.04)
- Europe
- North America > United States
- California > Alameda County
- Berkeley (0.04)
- Kansas > Cowley County (0.04)
- California > Alameda County
- Asia > China
- Genre:
- Research Report > New Finding (0.68)
- Technology: