stack trace
Stack Trace-Based Crash Deduplication with Transformer Adaptation
Mamun, Md Afif Al, Uddin, Gias, Xia, Lan, Zhang, Longyu
--Automated crash reporting systems generate large volumes of duplicate reports, overwhelming issue-tracking systems and increasing developer workload. Traditional stack trace-based deduplication methods--relying on string similarity, rule-based heuristics, or deep learning (DL) models--often fail to capture the contextual and structural relationships within stack traces. We propose dedupT, a transformer-based approach that models stack traces holistically rather than as isolated frames. Extensive experiments on real-world datasets show that dedupT outperforms existing DL and traditional methods (e.g., sequence alignment and information retrieval techniques) in both duplicate ranking and unique crash detection, significantly reducing manual triage effort. On four public datasets, dedupT improves Mean Reciprocal Rank (MRR) often by over 15% compared to the best DL baseline and up to 9% over traditional methods while achieving higher Receiver Operating Characteristic Area Under the Curve (ROC-AUC) in detecting unique crash reports. Our work advances the integration of modern natural language processing (NLP) techniques into software engineering, providing an effective solution for stack trace-based crash deduplication. Software issues are generally reported through (1) human-submitted reports and (2) automated crash reports. Human-reported issues typically include textual descriptions detailing the issue, expected and observed behavior, and may include attachments such as images or videos. In contrast, automated crash reports are generated by crash reporting tools (e.g., Sentry However, these automated systems often overwhelm ITS platforms by generating numerous duplicate crash reports for the same issue, requiring developers to manually review and triage them, which is a time-consuming process. For instance, Mozilla Firefox received 2.2 million issues in the first week of 2016, the majority being duplicates [1], while 72% of crash reports in the IntelliJ Platform were found to be duplicates [2]. In such scenarios, grouping similar crashes together is essential, a process known as crash deduplication . Unlike human-written reports with detailed descriptions, automated crash reports primarily contain technical data like stack traces and crash dumps. Figure 1: Example of a Java stack trace. Figure 1: Example of C++ stack trace.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Alberta > Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.04)
Hear Your Code Fail, Voice-Assisted Debugging for Python
Amiri, Sayed Mahbub Hasan, Islam, Md. Mainul, Hossen, Mohammad Shakhawat, Amiri, Sayed Majhab Hasan, Mamun, Mohammad Shawkat Ali, Kabir, Sk. Humaun, Akter, Naznin
This staggering performance drain translates to roughly $61 billion in yearly financial losses throughout the worldwide software market, as quantified by the Standish Group's 2023 analysis of advancement workflows. The core inefficiency stems from traditional debugging's visual - only paradigm, where deve lopers must manually parse dense, technical stack traces while mentally reconstructing error context a process requiring intense cognitive focus that fragments attention between code logic and exception diagnostics. Neuroergonomic research from MIT's Human - Computer Interaction Lab reveals this context - switching triggers measurable cognitive overload, increasing prefrontal cortex activation by 60% compared to focused coding tasks, ultimately leading to mental fatigue that compounds debugging errors. The accessibility limitations of conventional debugging tools create additional barriers for approximately 12.5% of professional developers with visual impairments (World Health Organization, 2024), who struggle with screen readers that poorly interpret te chnical tracebacks. As Sarah Parker, a blind Python developer at Microsoft, testified during the 2023 Accessible Tech Symposium: "NVDA reads exception blocks as disconnected fragments I spend more time reassembling error narratives than solving actual prob lems."
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.07)
- North America > United States > Washington > King County > Seattle (0.04)
- (2 more...)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Education > Educational Setting (0.93)
A Tool for Generating Exceptional Behavior Tests With Large Language Models
Zhong, Linghan, Yuan, Samuel, Zhang, Jiyang, Liu, Yu, Nie, Pengyu, Li, Junyi Jessy, Gligoric, Milos
Exceptional behavior tests (EBTs) are crucial in software development for verifying that code correctly handles unwanted events and throws appropriate exceptions. However, prior research has shown that developers often prioritize testing "happy paths", e.g., paths without unwanted events over exceptional scenarios. We present exLong, a framework that automatically generates EBTs to address this gap. exLong leverages a large language model (LLM) fine-tuned from CodeLlama and incorporates reasoning about exception-throwing traces, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. Our demonstration video illustrates how exLong can effectively assist developers in creating comprehensive EBTs for their project (available at https://youtu.be/Jro8kMgplZk).
- Europe > Norway > Central Norway > Trøndelag > Trondheim (0.05)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > Canada (0.04)
- (2 more...)
Fault Localization via Fine-tuning Large Language Models with Mutation Generated Stack Traces
Jambigi, Neetha, Bogacz, Bartosz, Mueller, Moritz, Bach, Thomas, Felderer, Michael
Abrupt and unexpected terminations of software are termed as software crashes. They can be challenging to analyze. Finding the root cause requires extensive manual effort and expertise to connect information sources like stack traces, source code, and logs. Typical approaches to fault localization require either test failures or source code. Crashes occurring in production environments, such as that of SAP HANA, provide solely crash logs and stack traces. We present a novel approach to localize faults based only on the stack trace information and no additional runtime information, by fine-tuning large language models (LLMs). We address complex cases where the root cause of a crash differs from the technical cause, and is not located in the innermost frame of the stack trace. As the number of historic crashes is insufficient to fine-tune LLMs, we augment our dataset by leveraging code mutators to inject synthetic crashes into the code base. By fine-tuning on 64,369 crashes resulting from 4.1 million mutations of the HANA code base, we can correctly predict the root cause location of a crash with an accuracy of 66.9\% while baselines only achieve 12.6% and 10.6%. We substantiate the generalizability of our approach by evaluating on two additional open-source databases, SQLite and DuckDB, achieving accuracies of 63% and 74%, respectively. Across all our experiments, fine-tuning consistently outperformed prompting non-finetuned LLMs for localizing faults in our datasets.
The Impact of Input Order Bias on Large Language Models for Software Fault Localization
Rafi, Md Nakhla, Kim, Dong Jae, Chen, Tse-Hsun, Wang, Shaowei
Large Language Models (LLMs) show great promise in software engineering tasks like Fault Localization (FL) and Automatic Program Repair (APR). This study examines how input order and context size affect LLM performance in FL, a key step for many downstream software engineering tasks. We test different orders for methods using Kendall Tau distances, including "perfect" (where ground truths come first) and "worst" (where ground truths come last). Our results show a strong bias in order, with Top-1 accuracy falling from 57\% to 20\% when we reverse the code order. Breaking down inputs into smaller contexts helps reduce this bias, narrowing the performance gap between perfect and worst orders from 22\% to just 1\%. We also look at ordering methods based on traditional FL techniques and metrics. Ordering using DepGraph's ranking achieves 48\% Top-1 accuracy, better than more straightforward ordering approaches like CallGraph. These findings underscore the importance of how we structure inputs, manage contexts, and choose ordering methods to improve LLM performance in FL and other software engineering tasks.
- North America > United States > New York > New York County > New York City (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (2 more...)
exLong: Generating Exceptional Behavior Tests with Large Language Models
Zhang, Jiyang, Liu, Yu, Nie, Pengyu, Li, Junyi Jessy, Gligoric, Milos
Many popular programming languages, including C#, Java, and Python, support exceptions. Exceptions are thrown during program execution if an unwanted event happens, e.g., a method is invoked with an illegal argument value. Software developers write exceptional behavior tests (EBTs) to check that their code detects unwanted events and throws appropriate exceptions. Prior research studies have shown the importance of EBTs, but those studies also highlighted that developers put most of their efforts on "happy paths", e.g., paths without unwanted events. To help developers fill the gap, we present the first framework, dubbed exLong, that automatically generates EBTs. exLong is a large language model instruction fine-tuned from CodeLlama and embeds reasoning about traces that lead to throw statements, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. We compare exLong with the state-of-the-art models for test generation (CAT-LM) and one of the strongest foundation models (GPT-4o), as well as with analysis-based tools for test generation (Randoop and EvoSuite). Our results show that exLong outperforms existing models and tools. Furthermore, we contributed several pull requests to open-source projects and 23 EBTs generated by exLong were already accepted.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
Stack Trace Deduplication: Faster, More Accurately, and in More Realistic Scenarios
Shibaev, Egor, Sushentsev, Denis, Golubev, Yaroslav, Khvorov, Aleksandr
In large-scale software systems, there are often no fully-fledged bug reports with human-written descriptions when an error occurs. In this case, developers rely on stack traces, i.e., series of function calls that led to the error. Since there can be tens and hundreds of thousands of them describing the same issue from different users, automatic deduplication into categories is necessary to allow for processing. Recent works have proposed powerful deep learning-based approaches for this, but they are evaluated and compared in isolation from real-life workflows, and it is not clear whether they will actually work well at scale. To overcome this gap, this work presents three main contributions: a novel model, an industry-based dataset, and a multi-faceted evaluation. Our model consists of two parts - (1) an embedding model with byte-pair encoding and approximate nearest neighbor search to quickly find the most relevant stack traces to the incoming one, and (2) a reranker that re-ranks the most fitting stack traces, taking into account the repeated frames between them. To complement the existing datasets collected from open-source projects, we share with the community SlowOps - a dataset of stack traces from IntelliJ-based products developed by JetBrains, which has an order of magnitude more stack traces per category. Finally, we carry out an evaluation that strives to be realistic: measuring not only the accuracy of categorization, but also the operation time and the ability to create new categories. The evaluation shows that our model strikes a good balance - it outperforms other models on both open-source datasets and SlowOps, while also being faster on time than most. We release all of our code and data, and hope that our work can pave the way to further practice-oriented research in the area.
- Research Report > Promising Solution (0.66)
- Research Report > New Finding (0.46)
- Information Technology > Software (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)
Labeling questions inside issue trackers
One of the issues faced by the maintainers of popular open source software is the triage of newly reported issues. Many of the issues submitted to issue trackers are questions. Many people ask questions on issue trackers about their problem instead of using a proper QA website like StackOverflow. This may seem insignificant but for many of the big projects with thousands of users, this leads to spamming of the issue tracker. Reading and labeling these unrelated issues manually is a serious time consuming task and these unrelated questions add to the burden. In fact, most often maintainers demand to not submit questions in the issue tracker. To address this problem, first, we leveraged dozens of patterns to clean text of issues, we removed noises like logs, stack traces, environment variables, error messages, etc. Second, we have implemented a classification-based approach to automatically label unrelated questions. Empirical evaluations on a dataset of more than 102,000 records show that our approach can label questions with an accuracy of over 81%.
- North America > United States > New York > New York County > New York City (0.04)
- Asia (0.04)
- North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
ChatDBG: An AI-Powered Debugging Assistant
Levin, Kyla, van Kempen, Nicolas, Berger, Emery D., Freund, Stephen N.
This paper presents ChatDBG, the first AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like "why is x null?". To handle these queries, ChatDBG grants the LLM autonomy to take the wheel and drive debugging by issuing commands to navigate through stacks and inspect program state; it then reports its findings and yields back control to the programmer. Our ChatDBG prototype integrates with standard debuggers including LLDB, GDB, and WinDBG for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded nearly 30,000 times.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (5 more...)
Combining Machine Learning and Lifetime-Based Resource Management for Memory Allocation and Beyond
Memory management is a decades-old research area24 that is fundamental to the performance of all applications. On modern architectures, memory managers determine a workload's ability to use 2MB (and 1GB) huge pages instead of traditional 4KB pages. The use of huge pages is crucial for performance on modern servers since they substantially reduce the cost of address translation by producing a wider reach in Translation Lookaside Buffers (TLB), reducing misses on the CPU's critical path.5 Current huge page-aware memory managers13 trade-off huge page usage with memory utilization, breaking up huge pages when they become inefficient. Figure 1 visualizes the source of this trade-off: When a C program allocates memory, it calls into a memory allocator library (e.g., TCMalloc13), which places the object at a particular address in memory until the program deletes it. The object may not move.