Chan, Aaron
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Agarwal, Anisha, Chan, Aaron, Chandel, Shubham, Jang, Jinu, Miller, Shaun, Moghaddam, Roshanak Zilouchian, Mohylevskyy, Yevhen, Sundaresan, Neel, Tufano, Michele
The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given scenario. Rather, each system requires the LLM to be honed to its set of heuristics to ensure the best performance. In this paper, we introduce the Copilot evaluation harness: a set of data and tools for evaluating LLM-guided IDE interactions, covering various programming scenarios and languages. We propose our metrics as a more robust and information-dense evaluation than previous state of the art evaluation systems. We design and compute both static and execution based success metrics for scenarios encompassing a wide range of developer tasks, including code generation from natural language (generate), documentation generation from code (doc), test case generation (test), bug-fixing (fix), and workspace understanding and query resolution (workspace). These success metrics are designed to evaluate the performance of LLMs within a given IDE and its respective parameter space. Our learnings from evaluating three common LLMs using these metrics can inform the development and validation of future scenarios in LLM guided IDEs.
Tailoring Self-Rationalizers with Multi-Reward Distillation
Ramnath, Sahana, Joshi, Brihi, Hallinan, Skyler, Lu, Ximing, Li, Liunian Harold, Chan, Aaron, Hessel, Jack, Choi, Yejin, Ren, Xiang
Large language models (LMs) are capable of generating free-text rationales to aid question answering. However, prior work 1) suggests that useful self-rationalization is emergent only at significant scales (e.g., 175B parameter GPT-3); and 2) focuses largely on downstream performance, ignoring the semantics of the rationales themselves, e.g., are they faithful, true, and helpful for humans? In this work, we enable small-scale LMs (approx. 200x smaller than GPT-3) to generate rationales that not only improve downstream task performance, but are also more plausible, consistent, and diverse, assessed both by automatic and human evaluation. Our method, MaRio (Multi-rewArd RatIOnalization), is a multi-reward conditioned self-rationalization algorithm that optimizes multiple distinct properties like plausibility, diversity and consistency. Results on five difficult question-answering datasets StrategyQA, QuaRel, OpenBookQA, NumerSense and QASC show that not only does MaRio improve task accuracy, but it also improves the self-rationalization quality of small LMs across the aforementioned axes better than a supervised fine-tuning (SFT) baseline. Extensive human evaluations confirm that MaRio rationales are preferred vs. SFT rationales, as well as qualitative improvements in plausibility and consistency.
Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models
Jiang, Song, Shakeri, Zahra, Chan, Aaron, Sanjabi, Maziar, Firooz, Hamed, Xia, Yinglong, Akyildiz, Bugra, Sun, Yizhou, Li, Jinchao, Wang, Qifan, Celikyilmaz, Asli
Chain-of-thought (CoT) prompting, which offers step-by-step problem-solving rationales, has impressively unlocked the reasoning potential of large language models (LLMs). Yet, the standard CoT is less effective in problems demanding multiple reasoning steps. This limitation arises from the complex reasoning process in multi-step problems: later stages often depend on the results of several steps earlier, not just the results of the immediately preceding step. Such complexities suggest the reasoning process is naturally represented as a graph. The almost linear and straightforward structure of CoT prompting, however, struggles to capture this complex reasoning graph. To address this challenge, we propose Residual Connection Prompting (RESPROMPT), a new prompting strategy that advances multi-step reasoning in LLMs. Our key idea is to reconstruct the reasoning graph within prompts. We achieve this by integrating necessary connections-links present in the reasoning graph but missing in the linear CoT flow-into the prompts. Termed "residual connections", these links are pivotal in morphing the linear CoT structure into a graph representation, effectively capturing the complex reasoning graphs inherent in multi-step problems. We evaluate RESPROMPT on six benchmarks across three diverse domains: math, sequential, and commonsense reasoning. For the open-sourced LLaMA family of models, RESPROMPT yields a significant average reasoning accuracy improvement of 12.5% on LLaMA-65B and 6.8% on LLaMA2-70B. Breakdown analysis further highlights RESPROMPT particularly excels in complex multi-step reasoning: for questions demanding at least five reasoning steps, RESPROMPT outperforms the best CoT based benchmarks by a remarkable average improvement of 21.1% on LLaMA-65B and 14.3% on LLaMA2-70B. Through extensive ablation studies and analyses, we pinpoint how to most effectively build residual connections.
Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning?
Chan, Aaron, Kharkar, Anant, Moghaddam, Roshanak Zilouchian, Mohylevskyy, Yevhen, Helyar, Alec, Kamal, Eslam, Elkamhawy, Mohamed, Sundaresan, Neel
Software vulnerabilities bear enterprises significant costs. Despite extensive efforts in research and development of software vulnerability detection methods, uncaught vulnerabilities continue to put software owners and users at risk. Many current vulnerability detection methods require that code snippets can compile and build before attempting detection. This, unfortunately, introduces a long latency between the time a vulnerability is injected to the time it is removed, which can substantially increases the cost of fixing a vulnerability. We recognize that the current advances in machine learning can be used to detect vulnerable code patterns on syntactically incomplete code snippets as the developer is writing the code at EditTime. In this paper we present a practical system that leverages deep learning on a large-scale data set of vulnerable code patterns to learn complex manifestations of more than 250 vulnerability types and detect vulnerable code patterns at EditTime. We discuss zero-shot, few-shot, and fine-tuning approaches on state of the art pre-trained Large Language Models (LLMs). We show that in comparison with state of the art vulnerability detection models our approach improves the state of the art by 10%. We also evaluate our approach to detect vulnerability in auto-generated code by code LLMs. Evaluation on a benchmark of high-risk code scenarios shows a reduction of up to 90% vulnerability reduction.
KNIFE: Distilling Reasoning Knowledge From Free-Text Rationales
Chan, Aaron, Zeng, Zhiyuan, Lake, Wyatt, Joshi, Brihi, Chen, Hanjie, Ren, Xiang
Language models (LMs) have yielded impressive results on many language reasoning tasks, but their unexpected errors raise doubts about their reasoning abilities. In light of this, there is growing interest in finetuning/prompting LMs with both task instances and their associated free-text rationales (FTRs), which explain the correct reasoning process for predicting the correct task output (i.e., how to be "right for the right reasons"). However, existing finetuning methods fail to improve LM performance, while prompting needs prohibitively large (i.e., >50B) LMs to work well. We propose KNIFE, which shows that reasoning knowledge can be effectively distilled from FTRs into a small (i.e., <1B) LM and improve the LM's performance. First, KNIFE finetunes a teacher LM (given task input and FTR) to predict the task output, transferring reasoning knowledge from the FTRs to the teacher's hidden states. Second, KNIFE finetunes a student LM (given task input only) such that its hidden states are aligned with the teacher's. Thus, the student is endowed with reasoning knowledge but can be used for inference without direct FTR input. On two question-answering datasets, KNIFE outperforms various finetuning and prompting baselines in fully-supervised and low-resource settings. Also, we observe that FTR quality is crucial to KNIFE's performance.
Are Machine Rationales (Not) Useful to Humans? Measuring and Improving Human Utility of Free-Text Rationales
Joshi, Brihi, Liu, Ziyi, Ramnath, Sahana, Chan, Aaron, Tong, Zhewei, Nie, Shaoliang, Wang, Qifan, Choi, Yejin, Ren, Xiang
Among the remarkable emergent capabilities of large language models (LMs) is free-text rationalization; beyond a certain scale, large LMs are capable of generating seemingly useful rationalizations, which in turn, can dramatically enhance their performances on leaderboards. This phenomenon raises a question: can machine generated rationales also be useful for humans, especially when lay humans try to answer questions based on those machine rationales? We observe that human utility of existing rationales is far from satisfactory, and expensive to estimate with human studies. Existing metrics like task performance of the LM generating the rationales, or similarity between generated and gold rationales are not good indicators of their human utility. While we observe that certain properties of rationales like conciseness and novelty are correlated with their human utility, estimating them without human involvement is challenging. We show that, by estimating a rationale's helpfulness in answering similar unseen instances, we can measure its human utility to a better extent. We also translate this finding into an automated score, GEN-U, that we propose, which can help improve LMs' ability to generate rationales with better human utility, while maintaining most of its task performance. Lastly, we release all code and collected data with this project.
PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales
Wang, Peifeng, Chan, Aaron, Ilievski, Filip, Chen, Muhao, Ren, Xiang
Neural language models (LMs) have achieved impressive results on various language-based reasoning tasks by utilizing latent knowledge encoded in their own pretrained parameters. To make this reasoning process more explicit, recent works retrieve a rationalizing LM's internal knowledge by training or prompting it to generate free-text rationales, which can be used to guide task predictions made by either the same LM or a separate reasoning LM. However, rationalizing LMs require expensive rationale annotation and/or computation, without any assurance that their generated rationales improve LM task performance or faithfully reflect LM decision-making. In this paper, we propose PINTO, an LM pipeline that rationalizes via prompt-based learning, and learns to faithfully reason over rationales via counterfactual regularization. First, PINTO maps out a suitable reasoning process for the task input by prompting a frozen rationalizing LM to generate a free-text rationale. Second, PINTO's reasoning LM is fine-tuned to solve the task using the generated rationale as context, while regularized to output less confident predictions when the rationale is perturbed. Across four datasets, we show that PINTO significantly improves the generalization ability of the reasoning LM, yielding higher performance on both in-distribution and out-of-distribution test sets. Also, we find that PINTO's rationales are more faithful to its task predictions than those generated by competitive baselines. Many language-based reasoning tasks require retrieving and reasoning over knowledge beyond the task input--e.g., commonsense reasoning and closed-book QA (Figure 1, left) (Talmor et al., 2018; Mihaylov et al., 2018). Neural language models (LMs) have achieved impressive results on such tasks by utilizing latent knowledge encoded in their pretrained parameters (Raffel et al., 2020b; Brown et al., 2020). Still, given LMs' black-box nature, it is unclear whether this knowledge is being used properly (Doshi-Velez & Kim, 2017; Lipton, 2018). Previous studies have shown that LMs often learn spurious correlations from artifacts in downstream training data, thus limiting their generalizability (Branco et al., 2021; Geirhos et al., 2020; D'Amour et al., 2020). With this in mind, a number of prior works aim to make LMs' reasoning processes more explicit by generating free-text rationales, which use LMs' internal knowledge to describe a reasoning process in natural language (Narang et al., 2020; Wei et al., 2022b; Marasoviฤ et al., 2022; Zelikman et al., 2022). In the fine-tuned self-rationalizing paradigm, a single LM is fine-tuned to jointly generate the task output and rationale (Narang et al., 2020; Marasoviฤ et al., 2022; Zelikman et al., 2022). In the prompted self-rationalizing paradigm, a single LM is instead frozen and prompted to jointly generate the task output and rationale, with the prompt consisting of a few input-output-rationale demonstrations (Wei et al., 2022b).
ER-Test: Evaluating Explanation Regularization Methods for Language Models
Joshi, Brihi, Chan, Aaron, Liu, Ziyi, Nie, Shaoliang, Sanjabi, Maziar, Firooz, Hamed, Ren, Xiang
By explaining how humans would solve a given task, human rationales can provide strong learning signal for neural language models (LMs). Explanation regularization (ER) aims to improve LM generalization by pushing the LM's machine rationales (Which input tokens did the LM focus on?) to align with human rationales (Which input tokens would humans focus on?). Though prior works primarily study ER via in-distribution (ID) evaluation, out-of-distribution (OOD) generalization is often more critical in real-world scenarios, yet ER's effect on OOD generalization has been underexplored. In this paper, we introduce ER-Test, a framework for evaluating ER models' OOD generalization along three dimensions: unseen dataset tests, contrast set tests, and functional tests. Using ER-Test, we extensively analyze how ER models' OOD generalization varies with different ER design choices. Across two tasks and six datasets, ER-Test shows that ER has little impact on ID performance but can yield large OOD performance gains. Also, we find that ER can improve OOD performance even with limited rationale supervision. ER-Test's results help demonstrate ER's utility and establish best practices for using ER effectively.
UNIREX: A Unified Learning Framework for Language Model Rationale Extraction
Chan, Aaron, Sanjabi, Maziar, Mathias, Lambert, Tan, Liang, Nie, Shaoliang, Peng, Xiaochang, Ren, Xiang, Firooz, Hamed
An extractive rationale explains a language model's (LM's) prediction on a given task instance by highlighting the text inputs that most influenced the prediction. Ideally, rationale extraction should be faithful (reflective of LM's actual behavior) and plausible (convincing to humans), without compromising the LM's (i.e., task model's) task performance. Although attribution algorithms and select-predict pipelines are commonly used in rationale extraction, they both rely on certain heuristics that hinder them from satisfying all three desiderata. In light of this, we propose UNIREX, a flexible learning framework that generalizes rationale extractor optimization as follows: (1) specify architecture for a learned rationale extractor; (2) select explainability objectives (i.e., faithfulness and plausibility criteria); and (3) jointly the train task model and rationale extractor on the task using the selected objectives. UNIREX enables replacing prior works' heuristic design choices with a generic learned rationale extractor in (1) and optimizing it for all three desiderata in (2)-(3). To facilitate comparison between methods with respect to multiple desiderata, we introduce the Normalized Relative Gain (NRG) metric. Across five text classification datasets, our best UNIREX configuration outperforms baselines by an average of 32.9% NRG. Plus, we find that UNIREX-trained rationale extractors can even generalize to unseen datasets and tasks.
FRAME: Evaluating Rationale-Label Consistency Metrics for Free-Text Rationales
Chan, Aaron, Nie, Shaoliang, Tan, Liang, Peng, Xiaochang, Firooz, Hamed, Sanjabi, Maziar, Ren, Xiang
Following how humans communicate, free-text rationales aim to use natural language to explain neural language model (LM) behavior. However, free-text rationales' unconstrained nature makes them prone to hallucination, so it is important to have metrics for free-text rationale quality. Existing free-text rationale metrics measure how consistent the rationale is with the LM's predicted label, but there is no protocol for assessing such metrics' reliability. Thus, we propose FRAME, a framework for evaluating rationale-label consistency (RLC) metrics for free-text rationales. FRAME is based on three axioms: (1) good metrics should yield highest scores for reference rationales, which maximize RLC by construction; (2) good metrics should be appropriately sensitive to semantic perturbation of rationales; and (3) good metrics should be robust to variation in the LM's task performance. Across three text classification datasets, we show that existing RLC metrics cannot satisfy all three FRAME axioms, since they are implemented via model pretraining which muddles the metric's signal. Then, we introduce a non-pretraining RLC metric that greatly outperforms baselines on (1) and (3), while performing competitively on (2). Finally, we discuss the limitations of using RLC to evaluate free-text rationales.