Radhakrishna, Arjun
Ordered Semantically Diverse Sampling for Textual Data
Tiwari, Ashish, Singh, Mukul, Singha, Ananya, Radhakrishna, Arjun
The goal of diversity sampling is to select a representative subset of data in a way that maximizes information contained in the subset while keeping its cardinality small. We introduce the ordered diverse sampling problem based on a new metric that measures the diversity in an ordered list of samples. We present a novel approach for generating ordered diverse samples for textual data that uses principal components on the embedding vectors. The proposed approach is simple and compared with existing approaches using the new metric. We transform standard text classification benchmarks into benchmarks for ordered diverse sampling. Our empirical evaluation shows that prevailing approaches perform 6% to 61% worse than our method while also being more time inefficient. Ablation studies show how the parts of the new approach contribute to the overall metrics.
TableTalk: Scaffolding Spreadsheet Development with a Language Agent
Liang, Jenny T., Kumar, Aayush, Bajpai, Yasharth, Gulwani, Sumit, Le, Vu, Parnin, Chris, Radhakrishna, Arjun, Tiwari, Ashish, Murphy-Hill, Emerson, Soares, Guastavo
Despite its ubiquity in the workforce, spreadsheet programming remains challenging as programmers need both spreadsheet-specific knowledge (e.g., APIs to write formulas) and problem-solving skills to create complex spreadsheets. Large language models (LLMs) can help automate aspects of this process, and recent advances in planning and reasoning have enabled language agents, which dynamically plan, use tools, and take iterative actions to complete complex tasks. These agents observe, plan, and act, making them well-suited to scaffold spreadsheet programming by following expert processes. We present TableTalk, a language agent that helps programmers build spreadsheets conversationally. Its design reifies three design principles -- scaffolding, flexibility, and incrementality -- which we derived from two studies of seven programmers and 62 Excel templates. TableTalk structures spreadsheet development by generating step-by-step plans and suggesting three next steps users can choose from. It also integrates tools that enable incremental spreadsheet construction. A user study with 20 programmers shows that TableTalk produces spreadsheets 2.3 times more likely to be preferred over a baseline agent, while reducing cognitive load and time spent reasoning about spreadsheet actions by 12.6%. TableTalk's approach has implications for human-agent collaboration. This includes providing persistent direct manipulation interfaces for stopping or undoing agent actions, while ensuring that such interfaces for accepting actions can be deactivated.
STACKFEED: Structured Textual Actor-Critic Knowledge Base Editing with FeedBack
Gupta, Naman, Kirtania, Shashank, Gupta, Priyanshu, Kariya, Krishna, Gulwani, Sumit, Iyer, Arun, Parthasarathy, Suresh, Radhakrishna, Arjun, Rajamani, Sriram K., Soares, Gustavo
Large Language Models (LLMs) often generate incorrect or outdated information, especially in low-resource settings or when dealing with private data. To address this, Retrieval-Augmented Generation (RAG) uses external knowledge bases (KBs), but these can also suffer from inaccuracies. We introduce STACKFEED, a novel Structured Textual Actor-Critic Knowledge base editing with FEEDback approach that iteratively refines the KB based on expert feedback using a multi-actor, centralized critic reinforcement learning framework. Each document is assigned to an actor, modeled as a ReACT agent, which performs structured edits based on document-specific targeted instructions from a centralized critic. Experimental results show that STACKFEED significantly improves KB quality and RAG system performance, enhancing accuracy by up to 8% over baselines.
METAREFLECTION: Learning Instructions for Language Agents using Past Reflections
Gupta, Priyanshu, Kirtania, Shashank, Singha, Ananya, Gulwani, Sumit, Radhakrishna, Arjun, Shi, Sherry, Soares, Gustavo
Despite the popularity of Large Language Models (LLMs), crafting specific prompts for LLMs to perform particular tasks remains challenging. Users often engage in multiple conversational turns with an LLM-based agent to accomplish their intended task. Recent studies have demonstrated that linguistic feedback, in the form of self-reflections generated by the model, can work as reinforcement during these conversations, thus enabling quicker convergence to the desired outcome. Motivated by these findings, we introduce METAREFLECTION, a novel technique that learns general prompt instructions for a specific domain from individual self-reflections gathered during a training phase. We evaluate our technique in two domains: Infrastructure as Code (IAC) vulnerability detection and question-answering (QA) using REACT and COT. Our results demonstrate a notable improvement, with METARELECTION outperforming GPT-4 by 16.82% (IAC), 31.33% (COT), and 15.42% (REACT), underscoring the potential of METAREFLECTION as a viable method for enhancing the efficiency of LLMs.
Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants
Chopra, Bhavya, Bajpai, Yasharth, Biyani, Param, Soares, Gustavo, Radhakrishna, Arjun, Parnin, Chris, Gulwani, Sumit
The widespread availability of Large Language Models (LLMs) within Integrated Development Environments (IDEs) has led to their speedy adoption. Conversational interactions with LLMs enable programmers to obtain natural language explanations for various software development tasks. However, LLMs often leap to action without sufficient context, giving rise to implicit assumptions and inaccurate responses. Conversations between developers and LLMs are primarily structured as question-answer pairs, where the developer is responsible for asking the the right questions and sustaining conversations across multiple turns. In this paper, we draw inspiration from interaction patterns and conversation analysis -- to design Robin, an enhanced conversational AI-assistant for debugging. Through a within-subjects user study with 12 industry professionals, we find that equipping the LLM to -- (1) leverage the insert expansion interaction pattern, (2) facilitate turn-taking, and (3) utilize debugging workflows -- leads to lowered conversation barriers, effective fault localization, and 5x improvement in bug resolution rates.
GrACE: Generation using Associated Code Edits
Gupta, Priyanshu, Khare, Avishree, Bajpai, Yasharth, Chakraborty, Saikat, Gulwani, Sumit, Kanade, Aditya, Radhakrishna, Arjun, Soares, Gustavo, Tiwari, Ashish
Developers expend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) of code with the knowledge of prior, relevant edits. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, Codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, the knowledge of prior edits boosts the performance of the LLMs significantly and enables them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively.
Overwatch: Learning Patterns in Code Edit Sequences
Zhang, Yuhao, Bajpai, Yasharth, Gupta, Priyanshu, Ketkar, Ameya, Allamanis, Miltiadis, Barik, Titus, Gulwani, Sumit, Radhakrishna, Arjun, Raza, Mohammad, Soares, Gustavo, Tiwari, Ashish
Integrated Development Environments (IDEs) provide tool support to automate many source code editing tasks. Traditionally, IDEs use only the spatial context, i.e., the location where the developer is editing, to generate candidate edit recommendations. However, spatial context alone is often not sufficient to confidently predict the developer's next edit, and thus IDEs generate many suggestions at a location. Therefore, IDEs generally do not actively offer suggestions and instead, the developer is usually required to click on a specific icon or menu and then select from a large list of potential suggestions. As a consequence, developers often miss the opportunity to use the tool support because they are not aware it exists or forget to use it. To better understand common patterns in developer behavior and produce better edit recommendations, we can additionally use the temporal context, i.e., the edits that a developer was recently performing. To enable edit recommendations based on temporal context, we present Overwatch, a novel technique for learning edit sequence patterns from traces of developers' edits performed in an IDE. Our experiments show that Overwatch has 78% precision and that Overwatch not only completed edits when developers missed the opportunity to use the IDE tool support but also predicted new edits that have no tool support in the IDE.
Multi-modal Program Inference: a Marriage of Pre-trainedLanguage Models and Component-based Synthesis
Rahmani, Kia, Raza, Mohammad, Gulwani, Sumit, Le, Vu, Morris, Daniel, Radhakrishna, Arjun, Soares, Gustavo, Tiwari, Ashish
Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more "complete" task description. Machine-learned pre-trained models (PTMs) are adept at handling ambiguous natural language, but struggle with generating syntactically and semantically precise code. Program synthesis techniques can generate correct code, often even from incomplete but precise specifications, such as examples, but they are unable to work with the ambiguity of natural languages. We present an approach that combines PTMs with component-based synthesis (CBS): PTMs are used to generate candidates programs from the natural language description of the task, which are then used to guide the CBS procedure to find the program that matches the precise examples-based specification. We use our combination approach to instantiate multi-modal synthesis systems for two programming domains: the domain of regular expressions and the domain of CSS selectors. Our evaluation demonstrates the effectiveness of our domain-agnostic approach in comparison to a state-of-the-art specialized system, and the generality of our approach in providing multi-modal program synthesis from natural language and examples in different programming domains.
Information-theoretic User Interaction: Significant Inputs for Program Synthesis
Tiwari, Ashish, Radhakrishna, Arjun, Gulwani, Sumit, Perelman, Daniel
Programming-by-example technologies are being deployed in industrial products for real-time synthesis of various kinds of data transformations. These technologies rely on the user to provide few representative examples of the transformation task. Motivated by the need to find the most pertinent question to ask the user, in this paper, we introduce the {\em significant questions problem}, and show that it is hard in general. We then develop an information-theoretic greedy approach for solving the problem. We justify the greedy algorithm using the conditional entropy result, which informally says that the question that achieves the maximum information gain is the one that we know least about. In the context of interactive program synthesis, we use the above result to develop an {\em{active program learner}} that generates the significant inputs to pose as queries to the user in each iteration. The procedure requires extending a {\em{passive program learner}} to a {\em{sampling program learner}} that is able to sample candidate programs from the set of all consistent programs to enable estimation of information gain. It also uses clustering of inputs based on features in the inputs and the corresponding outputs to sample a small set of candidate significant inputs. Our active learner is able to tradeoff false negatives for false positives and converge in a small number of iterations on a real-world dataset of %around 800 string transformation tasks.
Quantitative Programming by Examples
Gulwani, Sumit, Pathak, Kunal, Radhakrishna, Arjun, Tiwari, Ashish, Udupa, Abhishek
Programming-by-Example (PBE) systems synthesize an intended program in some (relatively constrained) domain-specific language from a small number of input-output examples provided by the user. In this paper, we motivate and define the problem of quantitative PBE (qPBE) that relates to synthesizing an intended program over an underlying (real world) programming language that also minimizes a given quantitative cost function. We present a modular approach for solving qPBE that consists of three phases: intent disambiguation, global search, and local search. On two concrete objectives, namely program performance and size, our qPBE procedure achieves $1.53 X$ and $1.26 X$ improvement respectively over the baseline FlashFill PBE system, averaged over $701$ benchmarks. Our detailed experiments validate the design of our procedure and show the value of combining global and local search for qPBE.