Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks

Jun-21-2024–arXiv.org Artificial Intelligence

Recent advancements in Large Language Models (LLMs) have significantly increased their use in various real-world applications, including information retrieval and coding assistance [1]. Notably, the dramatic expansion of context window sizes in models like GPT-4 [2], Claude 3 [3], and Gemini-1.5 [4] has broadened the potential applications of these models. To evaluate the retrieval capabilities of these LLMs within large context windows, a series of benchmarks known as Needle-in-a-Haystack (NIAH) [5] has been developed. The NIAH benchmarks [5] typically involve prompting an LLM to retrieve contextual information based on a clue (e.g., needle) hidden within a large document (e.g., background). These benchmarks have been effective in evaluating LLMs' ability to retrieve information from large text data such as in text-summarization, and legal and medical domains [6, 7, 8]. NIAH represents important use-cases finding precedent case law in the legal domain [7] and information retrieval from lengthy electronic health records in the medical domain [8]. Verifying the "faithfulness" of long-text-summarization has also been shown as an important NIAH task for the FABLES dataset [6]. Generating code and programs following provided specifications or requirements is a long-standing challenge in computer science called program synthesis [9].

benchmark, code stack, llm, (15 more...)

arXiv.org Artificial Intelligence

Jun-21-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > New York > New York County > New York City (0.04)

Genre:
- Research Report (0.82)

Industry:
- Law (0.88)
- Health & Medicine > Health Care Technology
  - Medical Record (0.54)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found