Goto

Collaborating Authors

 call graph


FlakyGuard: Automatically Fixing Flaky Tests at Industry Scale

arXiv.org Artificial Intelligence

Flaky tests that non-deterministically pass or fail waste developer time and slow release cycles. While large language models (LLMs) show promise for automatically repairing flaky tests, existing approaches like FlakyDoctor fail in industrial settings due to the context problem: providing either too little context (missing critical production code) or too much context (overwhelming the LLM with irrelevant information). We present FlakyGuard, which addresses this problem by treating code as a graph structure and using selective graph exploration to find only the most relevant context. Evaluation on real-world flaky tests from industrial repositories shows that FlakyGuard repairs 47.6 % of reproducible flaky tests with 51.8 % of the fixes accepted by developers. Besides it outperforms state-of-the-art approaches by at least 22 % in repair success rate. Developer surveys confirm that 100 % find FlakyGuard's root cause explanations useful.


Enhancing Neural Code Representation with Additional Context

arXiv.org Artificial Intelligence

Automated program comprehension underpins many software engineering tasks, from code summarisation to clone detection. Recent deep learning models achieve strong results but typically rely on source code alone, overlooking contextual information such as version history or structural relationships. This limits their ability to capture how code evolves and operates. We conduct an empirical study on how enriching code representations with such contextual signals affects neural model performance on key comprehension tasks. Two downstream tasks, code clone detection and code summarisation, are evaluated using SeSaMe (1,679 Java methods) and CodeSearchNet (63,259 methods). Five representative models (CodeBERT, GraphCodeBERT, CodeT5, PLBART, ASTNN) are fine-tuned under code-only and context-augmented settings. Results show that context generally improves performance: version history consistently boosts clone detection (e.g., CodeT5 +15.92% F1) and summarisation (e.g., GraphCodeBERT +5.56% METEOR), while call-graph effects vary by model and task. Combining multiple contexts yields further gains (up to +21.48% macro-F1). Human evaluation on 100 Java snippets confirms that context-augmented summaries are significantly preferred for Accuracy and Content Adequacy (p <= 0.026; |delta| up to 0.55). These findings highlight the potential of contextual signals to enhance code comprehension and open new directions for optimising contextual encoding in neural SE models.


Learning Joint Embeddings of Function and Process Call Graphs for Malware Detection

arXiv.org Artificial Intelligence

Software systems can be represented as graphs, capturing dependencies among functions and processes. An interesting aspect of software systems is that they can be represented as different types of graphs, depending on the extraction goals and priorities. For example, function calls within the software can be captured to create function call graphs, which highlight the relationships between functions and their dependencies. Alternatively, the processes spawned by the software can be modeled to generate process interaction graphs, which focus on runtime behavior and inter-process communication. While these graph representations are related, each captures a distinct perspective of the system, providing complementary insights into its structure and operation. While previous studies have leveraged graph neural networks (GNNs) to analyze software behaviors, most of this work has focused on a single type of graph representation. The joint modeling of both function call graphs and process interaction graphs remains largely underexplored, leaving opportunities for deeper, multi-perspective analysis of software systems. This paper presents a pipeline for constructing and training Function Call Graphs (FCGs) and Process Call Graphs (PCGs) and learning joint embeddings. We demonstrate that joint embeddings outperform a single-graph model. In this paper, we propose GeminiNet, a unified neural network approach that learns joint embeddings from both FCGs and PCGs. We construct a new dataset of 635 Windows executables (318 malicious and 317 benign), extracting FCGs via Ghidra and PCGs via Any.Run sandbox. GeminiNet employs dual graph convolutional branches with an adaptive gating mechanism that balances contributions from static and dynamic views.


VOLTRON: Detecting Unknown Malware Using Graph-Based Zero-Shot Learning

arXiv.org Artificial Intelligence

The persistent threat of Android malware presents a serious challenge to the security of millions of users globally. While many machine learning-based methods have been developed to detect these threats, their reliance on large labeled datasets limits their effectiveness against emerging, previously unseen malware families, for which labeled data is scarce or nonexistent. To address this challenge, we introduce a novel zero-shot learning framework that combines Variational Graph Auto-Encoders (VGAE) with Siamese Neural Networks (SNN) to identify malware without needing prior examples of specific malware families. Our approach leverages graph-based representations of Android applications, enabling the model to detect subtle structural differences between benign and malicious software, even in the absence of labeled data for new threats. Experimental results show that our method outperforms the state-of-the-art MaMaDroid, especially in zero-day malware detection. Our model achieves 96.24% accuracy and 95.20% recall for unknown malware families, highlighting its robustness against evolving Android threats.


Call Me Maybe: Enhancing JavaScript Call Graph Construction using Graph Neural Networks

arXiv.org Artificial Intelligence

Static analysis plays a key role in finding bugs, including security issues. A critical step in static analysis is building accurate call graphs that model function calls in a program. However, due to hard-to-analyze language features, existing call graph construction algorithms for JavaScript are neither sound nor complete. Prior work shows that even advanced solutions produce false edges and miss valid ones. In this work, we assist these tools by identifying missed call edges. Our main idea is to frame the problem as link prediction on full program graphs, using a rich representation with multiple edge types. Our approach, GRAPHIA, leverages recent advances in graph neural networks to model non-local relationships between code elements. Concretely, we propose representing JavaScript programs using a combination of syntactic- and semantic-based edges. GRAPHIA can learn from imperfect labels, including static call edges from existing tools and dynamic edges from tests, either from the same or different projects. Because call graphs are sparse, standard machine learning metrics like ROC are not suitable. Instead, we evaluate GRAPHIA by ranking function definitions for each unresolved call site. We conduct a large-scale evaluation on 50 popular JavaScript libraries with 163K call edges (150K static and 13K dynamic). GRAPHIA builds program graphs with 6.6M structural and 386K semantic edges. It ranks the correct target as the top candidate in over 42% of unresolved cases and within the top 5 in 72% of cases, reducing the manual effort needed for analysis. Our results show that learning-based methods can improve the recall of JavaScript call graph construction. To our knowledge, this is the first work to apply GNN-based link prediction to full multi-file program graphs for interprocedural analysis.


CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching

arXiv.org Artificial Intelligence

Large language models (LLMs) have significantly advanced autonomous software engineering, leading to a growing number of software engineering agents that assist developers in automatic program repair. Issue localization forms the basis for accurate patch generation. However, because of limitations caused by the context window length of LLMs, existing issue localization methods face challenges in balancing concise yet effective contexts and adequately comprehensive search spaces. In this paper, we introduce CoSIL, an LLM driven, simple yet powerful function level issue localization method without training or indexing. CoSIL reduces the search space through module call graphs, iteratively searches the function call graph to obtain relevant contexts, and uses context pruning to control the search direction and manage contexts effectively. Importantly, the call graph is dynamically constructed by the LLM during search, eliminating the need for pre-parsing. Experiment results demonstrate that CoSIL achieves a Top-1 localization success rate of 43 percent and 44.6 percent on SWE bench Lite and SWE bench Verified, respectively, using Qwen2.5 Coder 32B, outperforming existing methods by 8.6 to 98.2 percent. When CoSIL is applied to guide the patch generation stage, the resolved rate further improves by 9.3 to 31.5 percent.


DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

arXiv.org Artificial Intelligence

The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes them vulnerable to memorization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that overcomes the limitations of static datasets. DynaCode evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. DynaCode achieves large-scale diversity, generating up to 189 million unique nested code problems across four distinct levels of code complexity, referred to as units, and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark, with performance progressively decreasing as complexity increases. This demonstrates DynaCode's ability to effectively differentiate LLMs. Additionally, by leveraging call graphs, we gain insights into LLM behavior, particularly their preference for handling subfunction interactions within nested code.


Large Language Models as Realistic Microservice Trace Generators

arXiv.org Artificial Intelligence

Workload traces are essential to understand complex computer systems' behavior and manage processing and memory resources. Since real-world traces are hard to obtain, synthetic trace generation is a promising alternative. This paper proposes a first-of-a-kind approach that relies on training a large language model (LLM) to generate synthetic workload traces, specifically microservice call graphs. To capture complex and arbitrary hierarchical structures and implicit constraints in such traces, we show how to fine-tune LLMs to generate recursively, making call graph generation a sequence of easier steps. To further enforce learning constraints in traces and generate uncommon situations, we argue for applying additional instruction tuning steps to align our model with the desired trace features. Our evaluation results show that we can generate diverse realistic traces under various conditions and outperform existing methods in accuracy and validity. We demonstrate that our synthetically generated traces can effectively replace real data to optimize important microservice management tasks. Additionally, our model adapts to downstream trace-related tasks, such as predicting key trace features and infilling missing data.


The Potential of LLMs in Automating Software Testing: From Generation to Reporting

arXiv.org Artificial Intelligence

Having a high quality software is essential in software engineering, which requires robust validation and verification processes during testing activities. Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods. Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering, particularly in areas like requirements analysis, test automation, and debugging. This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency. The proposed framework integrates LLMs to generate unit tests, visualize call graphs, and automate test execution and reporting. Evaluations across multiple applications in Python and Java demonstrate the system's high test coverage and efficient operation. This research underscores the potential of LLM-powered agents to streamline software testing workflows while addressing challenges in scalability and accuracy.


Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation

arXiv.org Artificial Intelligence

Generating accurate code review comments remains a significant challenge due to the inherently diverse and non-unique nature of the task output. Large language models pretrained on both programming and natural language data tend to perform well in code-oriented tasks. However, large-scale pretraining is not always feasible due to its environmental impact and project-specific generalizability issues. In this work, first we fine-tune open-source Large language models (LLM) in parameter-efficient, quantized low-rank (QLoRA) fashion on consumer-grade hardware to improve review comment generation. Recent studies demonstrate the efficacy of augmenting semantic metadata information into prompts to boost performance in other code-related tasks. To explore this in code review activities, we also prompt proprietary, closed-source LLMs augmenting the input code patch with function call graphs and code summaries. Both of our strategies improve the review comment generation performance, with function call graph augmented few-shot prompting on the GPT-3.5 model surpassing the pretrained baseline by around 90% BLEU-4 score on the CodeReviewer dataset. Moreover, few-shot prompted Gemini-1.0 Pro, QLoRA fine-tuned Code Llama and Llama 3.1 models achieve competitive results (ranging from 25% to 83% performance improvement) on this task. An additional human evaluation study further validates our experimental findings, reflecting real-world developers' perceptions of LLM-generated code review comments based on relevant qualitative metrics.