Goto

Collaborating Authors

 Shetty, Manish


Challenges and Paths Towards AI for Software Engineering

arXiv.org Artificial Intelligence

AI for software engineering has made remarkable progress recently, becoming a notable success within generative AI. Despite this, there are still many challenges that need to be addressed before automated software engineering reaches its full potential. It should be possible to reach high levels of automation where humans can focus on the critical decisions of what to build and how to balance difficult tradeoffs while most routine development effort is automated away. Reaching this level of automation will require substantial research and engineering efforts across academia and industry. In this paper, we aim to discuss progress towards this in a threefold manner. First, we provide a structured taxonomy of concrete tasks in AI for software engineering, emphasizing the many other tasks in software engineering beyond code generation and completion. Second, we outline several key bottlenecks that limit current approaches. Finally, we provide an opinionated list of promising research directions toward making progress on these bottlenecks, hoping to inspire future research in this rapidly maturing field.


AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

arXiv.org Artificial Intelligence

AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next-generation AIOps agents. Through evaluations of state-of-the-art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.


Syzygy: Dual Code-Test C to (safe) Rust Translation using LLMs and Dynamic Analysis

arXiv.org Artificial Intelligence

It is further motivated by the fact that both C and Rust can target similar applications (low-level, performance-critical libraries) and are supported by Clang-based compiler toolchains. Though desirable, C-Rust translation is challenging: C and (safe) Rust employ different typing systems (strongly typed variables and no raw pointers in Rust) and different variable access rules (arbitrary accesses in C, while strict borrowing rules in Rust), amongst other differences. Manual migration of even moderately sized codebases requires multiple person-weeks of effort, motivating the need for automatic translation techniques. There are two main approaches for code translation: rule-based/symbolic and LLM (Large Language Model)-based. Rule-based translation approaches often operate on a terse intermediate representation (for achieving full coverage with a limited rule set) and thus often produce uninterpretable target code. Symbolic program synthesis approaches (e.g., [2, 34]), on the other hand, often do not scale to multi-function codebases. LLMs shine in both these respects: they produce natural/interpretable code and have superior scaling capabilities.


Building AI Agents for Autonomous Clouds: Challenges and Design Principles

arXiv.org Artificial Intelligence

The rapid growth in the use of Large Language Models (LLMs) and AI Agents as part of software development and deployment is revolutionizing the information technology landscape. While code generation receives significant attention, a higher-impact application lies in using AI agents for operational resilience of cloud services, which currently require significant human effort and domain knowledge. There is a growing interest in AI for IT Operations (AIOps) which aims to automate complex operational tasks, like fault localization and root cause analysis, thereby reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds though AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents. This vision paper lays the groundwork for such a framework by first framing the requirements and then discussing design decisions that satisfy them. We also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. We report promising results and lay the groundwork to build a modular and robust framework for building, evaluating, and improving agents for autonomous clouds.


DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines

arXiv.org Artificial Intelligence

Chaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic "prompt engineering". We introduce LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. We integrate our constructs into the recent DSPy programming model for LMs, and present new strategies that allow DSPy to compile programs with LM Assertions into more reliable and accurate systems. We also propose strategies to use assertions at inference time for automatic self-refinement with LMs. We report on four diverse case studies for text generation and find that LM Assertions improve not only compliance with imposed rules but also downstream task performance, passing constraints up to 164% more often and generating up to 37% more higher-quality responses. Our reference implementation of LM Assertions is integrated into DSPy at https://github.com/stanfordnlp/dspy


CodeScholar: Growing Idiomatic Code Examples

arXiv.org Artificial Intelligence

Programmers often search for usage examples for API methods. A tool that could generate realistic, idiomatic, and contextual usage examples for one or more APIs would be immensely beneficial to developers. Such a tool would relieve the need for a deep understanding of the API landscape, augment existing documentation, and help discover interactions among APIs. We present CodeScholar, a tool that generates idiomatic code examples demonstrating the common usage of API methods. It includes a novel neural-guided search technique over graphs that grows the query APIs into idiomatic code examples. Our user study demonstrates that in 70% of cases, developers prefer CodeScholar generated examples over state-of-the-art large language models (LLM) like GPT3.5. We quantitatively evaluate 60 single and 25 multi-API queries from 6 popular Python libraries and show that across-the-board CodeScholar generates more realistic, diverse, and concise examples. In addition, we show that CodeScholar not only helps developers but also LLM-powered programming assistants generate correct code in a program synthesis setting.


Large-scale Crash Localization using Multi-Task Learning

arXiv.org Artificial Intelligence

Crash localization, an important step in debugging crashes, is challenging when dealing with an extremely large number of diverse applications and platforms and underlying root causes. Large-scale error reporting systems, e.g., Windows Error Reporting (WER), commonly rely on manually developed rules and heuristics to localize blamed frames causing the crashes. As new applications and features are routinely introduced and existing applications are run under new environments, developing new rules and maintaining existing ones become extremely challenging. We propose a data-driven solution to address the problem. We start with the first large-scale empirical study of 362K crashes and their blamed methods reported to WER by tens of thousands of applications running in the field. The analysis provides valuable insights on where and how the crashes happen and what methods to blame for the crashes. These insights enable us to develop DeepAnalyze, a novel multi-task sequence labeling approach for identifying blamed frames in stack traces. We evaluate our model with over a million real-world crashes from four popular Microsoft applications and show that DeepAnalyze, trained with crashes from one set of applications, not only accurately localizes crashes of the same applications, but also bootstraps crash localization for other applications with zero to very little additional training data.


Mining Knowledge Graphs From Incident Reports

arXiv.org Artificial Intelligence

Incident management is a critical part of the DevOps processes for developing and operating large-scale services in the cloud. Incident reports filed by customers are largely unstructured making any automated diagnosis or mitigation non-trivial. It requires on-call engineers to parse verbose reports to understand the issue and locate key information. Prior work has looked into extraction of key attributes or entities like error codes, tenant Ids, stack traces, etc. from incident and bug reports. Although a flat list of entities is informative, to unlock the full potential of knowledge extraction, it is necessary to provide context to these entities. For instance, the relations between the real-world concepts or objects that these entities represent in otherwise unstructured data is useful for downstream tasks like incident linking, triaging and mitigation. With this additional context, entities are transformed from "Strings" to "Things". In this work, we present an approach to mine and score binary entity relations from co-occurring entity pairs. We evaluate binary relations extracted and show that our approach has a high precision of 0.9. Further, we construct knowledge graphs automatically and show that the implicit knowledge in the graph can be used to mine and rank relevant entities for distinct incidents, by mapping entities to clusters of incident titles.


Neural Knowledge Extraction From Cloud Service Incidents

arXiv.org Artificial Intelligence

In the last decade, two paradigm shifts have reshaped the software industry - the move from boxed products to services and the widespread adoption of cloud computing. This has had a huge impact on the software development life cycle and the DevOps processes. Particularly, incident management has become critical for developing and operating large-scale services. Incidents are created to ensure timely communication of service issues and, also, their resolution. Prior work on incident management has been heavily focused on the challenges with incident triaging and de-duplication. In this work, we address the fundamental problem of structured knowledge extraction from service incidents. We have built SoftNER, a framework for unsupervised knowledge extraction from service incidents. We frame the knowledge extraction problem as a Named-entity Recognition task for extracting factual information. SoftNER leverages structural patterns like key,value pairs and tables for bootstrapping the training data. Further, we build a novel multi-task learning based BiLSTM-CRF model which leverages not just the semantic context but also the data-types for named-entity extraction. We have deployed SoftNER at Microsoft, a major cloud service provider and have evaluated it on more than 2 months of cloud incidents. We show that the unsupervised machine learning based approach has a high precision of 0.96. Our multi-task learning based deep learning model also outperforms the state of the art NER models. Lastly, using the knowledge extracted by SoftNER we are able to build significantly more accurate models for important downstream tasks like incident triaging.