Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks
Lee, Hokyung, Sharma, Sumanyu, Hu, Bing
–arXiv.org Artificial Intelligence
Recent advancements in Large Language Models (LLMs) have significantly increased their use in various real-world applications, including information retrieval and coding assistance [1]. Notably, the dramatic expansion of context window sizes in models like GPT-4 [2], Claude 3 [3], and Gemini-1.5 [4] has broadened the potential applications of these models. To evaluate the retrieval capabilities of these LLMs within large context windows, a series of benchmarks known as Needle-in-a-Haystack (NIAH) [5] has been developed. The NIAH benchmarks [5] typically involve prompting an LLM to retrieve contextual information based on a clue (e.g., needle) hidden within a large document (e.g., background). These benchmarks have been effective in evaluating LLMs' ability to retrieve information from large text data such as in text-summarization, and legal and medical domains [6, 7, 8]. NIAH represents important use-cases finding precedent case law in the legal domain [7] and information retrieval from lengthy electronic health records in the medical domain [8]. Verifying the "faithfulness" of long-text-summarization has also been shown as an important NIAH task for the FABLES dataset [6]. Generating code and programs following provided specifications or requirements is a long-standing challenge in computer science called program synthesis [9].
arXiv.org Artificial Intelligence
Jun-21-2024
- Country:
- North America > United States > New York > New York County > New York City (0.04)
- Genre:
- Research Report (0.82)
- Industry:
- Health & Medicine > Health Care Technology
- Medical Record (0.54)
- Law (0.88)
- Health & Medicine > Health Care Technology
- Technology: