A Benchmark for Localizing Code and Non-Code Issues in Software Projects

Zhang, Zejun, Wang, Jian, Yang, Qingyun, Pan, Yifan, Tang, Yi, Li, Yi, Xing, Zhenchang, Zhang, Tian, Li, Xuandong, Zhang, Guoan

arXiv.org Artificial Intelligence 

Accurate project localization (e.g., files and functions) for issue resolution is a critical first step in software maintenance. However, existing benchmarks for issue localization, such as SWE-Bench and LocBench, are limited. They focus predominantly on pull-request issues and code locations, ignoring other evidence and non-code files such as commits, comments, configurations, and documentation. To address this gap, we introduce MULocBench, a comprehensive dataset of 1,100 issues from 46 popular GitHub Python projects. Comparing with existing benchmarks, MULocBench offers greater diversity in issue types, root causes, location scopes, and file types, providing a more realistic testbed for evaluation. Using this benchmark, we assess the performance of state-of-the-art localization methods and five LLM-based prompting strategies. Our results reveal significant limitations in current techniques: even at the file level, performance metrics (Acc@5, F1) remain below 40%. This underscores the challenge of generalizing to realistic, multi-faceted issue resolution. Modern software projects are inherently complex. They often consist of thousands of files spanning code, configurations, tests, and documentation. The complexity making developers routinely encounter a wide spectrum of issues, ranging from runtime failures and unexpected results to enhancement requests and usage questions. A prerequisite for resolving these issues is to accurately identify the locations, such as the relevant files and functions. Existing benchmarks have advanced research on issue localization. SWE-Bench Jimenez et al. collects 2,294 issues with pull requests from 12 Python projects, primarily targeting bug fixing. To encourage adoption, it releases SWE-bench Lite, a subset of 300 instances.