Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code

Jiang, Nan, Li, Qi, Tan, Lin, Zhang, Tianyi

arXiv.org Artificial Intelligence 

While much research has focused on hallucinations in multiple modalities including images and natural language text, less attention has been given to hallucinations in source code, which leads to incorrect and vulnerable code that causes significant financial loss. To pave the way for research in LLMs' hallucinations in code, we introduce Collu-Bench, a benchmark for predicting code hallucinations of LLMs across code generation (CG) and automated program repair (APR) tasks. Collu-Bench includes 13,234 code hallucination instances collected from five datasets and 11 diverse LLMs, ranging from open-source models to commercial ones. To better understand and predict code hallucinations, Collu-Bench provides detailed features such as the per-step log probabilities of LLMs' output, token types, and the execution feedback of LLMs' generated code for in-depth analysis. In addition, we conduct experiments to predict hallucination on Collu-Bench, using both traditional machine learning techniques and neural networks, which achieves 22.03 - 33.15% accuracy. Our experiments draw insightful findings of code hallucination patterns, reveal the challenge of accurately localizing LLMs' hallucinations, and highlight the need for more sophisticated techniques. Despite the great potential and impressive success of LLMs (Touvron et al., 2023; Brown et al., 2020; Li et al., 2022a; OpenAI, 2024), a known issue of LLMs is hallucination, a phenomenon where the model generates fluent and plausible-sounding but unfaithful or fabricated content (Ji et al., 2023). The hallucination issue poses a significant risk when deploying LLMs in real-world applications that require precise information (Puchert et al., 2023). Due to this importance, researchers have developed benchmarks such as TruthfulQA (Lin et al., 2022), FELM (chen et al., 2023), and HaluEval (Li et al., 2023b) to understand and predict hallucinations of LLMs. Additionally, researchers are actively exploring methods to mitigate hallucinations (Liu et al., 2024b; Elaraby et al., 2023; Dhuliawala et al., 2023; Yan et al., 2024). Another domain where LLMs have been widely applied is source code.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found