Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks

Lee, Hokyung, Sharma, Sumanyu, Hu, Bing

arXiv.org Artificial Intelligence 

Recent advancements in Large Language Models (LLMs) have significantly increased their use in various real-world applications, including information retrieval and coding assistance [1]. Notably, the dramatic expansion of context window sizes in models like GPT-4 [2], Claude 3 [3], and Gemini-1.5 [4] has broadened the potential applications of these models. To evaluate the retrieval capabilities of these LLMs within large context windows, a series of benchmarks known as Needle-in-a-Haystack (NIAH) [5] has been developed. The NIAH benchmarks [5] typically involve prompting an LLM to retrieve contextual information based on a clue (e.g., needle) hidden within a large document (e.g., background). These benchmarks have been effective in evaluating LLMs' ability to retrieve information from large text data such as in text-summarization, and legal and medical domains [6, 7, 8]. NIAH represents important use-cases finding precedent case law in the legal domain [7] and information retrieval from lengthy electronic health records in the medical domain [8]. Verifying the "faithfulness" of long-text-summarization has also been shown as an important NIAH task for the FABLES dataset [6]. Generating code and programs following provided specifications or requirements is a long-standing challenge in computer science called program synthesis [9].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found