Balashankar, Ananth
InfAlign: Inference-aware language model alignment
Balashankar, Ananth, Sun, Ziteng, Berant, Jonathan, Eisenstein, Jacob, Collins, Michael, Hutter, Adrian, Lee, Jong, Nagpal, Chirag, Prost, Flavien, Sinha, Aradhana, Suresh, Ananda Theertha, Beirami, Ahmad
Language model alignment has become a critical step in training modern generative language models. The goal of alignment is to finetune a reference model such that the win rate of a sample from the aligned model over a sample from the reference model is high, subject to a KL divergence constraint. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. However, the alignment objective does not capture such inference-time decoding procedures. We show that the existing alignment framework is sub-optimal in view of such inference-time methods. We then modify the alignment objective and propose a framework for inference-aware alignment (IAPO). We prove that for any inference-time decoding algorithm, the optimal solution that optimizes the inference-time win rate of the aligned policy against the reference policy is the solution to the typical RLHF problem with a transformation of the reward. This motivates us to provide the KL-regularized calibrate-and-transform RL (CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. We particularize our study to two important inference-time strategies: best-of-N sampling and best-of-N jailbreaking, where N responses are sampled from the model and the one with the highest or lowest reward is selected. We propose specific transformations for these strategies and demonstrate that our framework offers significant improvements over existing state-of-the-art methods for language model alignment. Empirically, we outperform baselines that are designed without taking inference-time decoding into consideration by 8-12% and 4-9% on inference-time win rates over the Anthropic helpfulness and harmlessness dialog benchmark datasets.
Comprehensive Monitoring of Air Pollution Hotspots Using Sparse Sensor Networks
Bhardwaj, Ankit, Balashankar, Ananth, Iyer, Shiva, Soans, Nita, Sudarshan, Anant, Pande, Rohini, Subramanian, Lakshminarayanan
Urban air pollution hotspots pose significant health risks, yet their detection and analysis remain limited by the sparsity of public sensor networks. This paper addresses this challenge by combining predictive modeling and mechanistic approaches to comprehensively monitor pollution hotspots. We enhanced New Delhi's existing sensor network with 28 low-cost sensors, collecting PM2.5 data over 30 months from May 1, 2018, to Nov 1, 2020. Applying established definitions of hotspots to this data, we found the existence of additional 189 hidden hotspots apart from confirming 660 hotspots detected by the public network. Using predictive techniques like Space-Time Kriging, we identified hidden hotspots with 95% precision and 88% recall with 50% sensor failure rate, and with 98% precision and 95% recall with 50% missing sensors. The projected results of our predictive models were further compiled into policy recommendations for public authorities. Additionally, we developed a Gaussian Plume Dispersion Model to understand the mechanistic underpinnings of hotspot formation, incorporating an emissions inventory derived from local sources. Our mechanistic model is able to explain 65% of observed transient hotspots. Our findings underscore the importance of integrating data-driven predictive models with physics-based mechanistic models for scalable and robust air pollution management in resource-constrained settings.
Inducing Group Fairness in LLM-Based Decisions
Atwood, James, Lahoti, Preethi, Balashankar, Ananth, Prost, Flavien, Beirami, Ahmad
Prompting Large Language Models (LLMs) has created new and interesting means for classifying textual data. While evaluating and remediating group fairness is a well-studied problem in classifier fairness literature, some classical approaches (e.g., regularization) do not carry over, and some new opportunities arise (e.g., prompt-based remediation). We measure fairness of LLM-based classifiers on a toxicity classification task, and empirically show that prompt-based classifiers may lead to unfair decisions. We introduce several remediation techniques and benchmark their fairness and performance trade-offs. We hope our work encourages more research on group fairness in LLM-based classifiers.
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
Wu, Zhaofeng, Balashankar, Ananth, Kim, Yoon, Eisenstein, Jacob, Beirami, Ahmad
Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.
Effective Robustness against Natural Distribution Shifts for Models with Different Training Data
Shi, Zhouxing, Carlini, Nicholas, Balashankar, Ananth, Schmidt, Ludwig, Hsieh, Cho-Jui, Beutel, Alex, Qin, Yao
"Effective robustness" measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate the ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data. To do this, we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of effective robustness when there are models with different training data. It may also explain the surprising effective robustness gains of zero-shot CLIP-like models exhibited in prior works that used ImageNet as the only ID test set, while the gains diminish under our new evaluation.
Improving Few-shot Generalization of Safety Classifiers via Data Augmented Parameter-Efficient Fine-Tuning
Balashankar, Ananth, Ma, Xiao, Sinha, Aradhana, Beirami, Ahmad, Qin, Yao, Chen, Jilin, Beutel, Alex
As large language models (LLMs) are widely adopted, new safety issues and policies emerge, to which existing safety classifiers do not generalize well. If we have only observed a few examples of violations of a new safety rule, how can we build a classifier to detect violations? In this paper, we study the novel setting of domain-generalized few-shot learning for LLM-based text safety classifiers. Unlike prior few-shot work, these new safety issues can be hard to uncover and we do not get to choose the few examples. We demonstrate that existing few-shot techniques do not perform well in this setting, and rather we propose to do parameter-efficient fine-tuning (PEFT) combined with augmenting training data based on similar examples in prior existing rules. We empirically show that our approach of similarity-based data-augmentation + prompt-tuning (DAPT) consistently outperforms baselines that either do not rely on data augmentation or on PEFT by 7-17% F1 score in the Social Chemistry moral judgement and 9-13% AUC in the Toxicity detection tasks, even when the new rule is loosely correlated with existing ones.
Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
Sinha, Aradhana, Balashankar, Ananth, Beirami, Ahmad, Avrahami, Thi, Chen, Jilin, Beutel, Alex
Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1%$\,\to\,$50.1%) and on two future unseen rounds of human generated attacks (32.5%$\,\to\,$43.4%, and 29.4%$\,\to\,$40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.
Improving Classifier Robustness through Active Generation of Pairwise Counterfactuals
Balashankar, Ananth, Wang, Xuezhi, Qin, Yao, Packer, Ben, Thain, Nithum, Chen, Jilin, Chi, Ed H., Beutel, Alex
Counterfactual Data Augmentation (CDA) is a commonly used technique for improving robustness in natural language classifiers. However, one fundamental challenge is how to discover meaningful counterfactuals and efficiently label them, with minimal human labeling cost. Most existing methods either completely rely on human-annotated labels, an expensive process which limits the scale of counterfactual data, or implicitly assume label invariance, which may mislead the model with incorrect labels. In this paper, we present a novel framework that utilizes counterfactual generative models to generate a large number of diverse counterfactuals by actively sampling from regions of uncertainty, and then automatically label them with a learned pairwise classifier. Our key insight is that we can more correctly label the generated counterfactuals by training a pairwise classifier that interpolates the relationship between the original example and the counterfactual. We demonstrate that with a small amount of human-annotated counterfactual data (10%), we can generate a counterfactual augmentation dataset with learned labels, that provides an 18-20% improvement in robustness and a 14-21% reduction in errors on 6 out-of-domain datasets, comparable to that of a fully human-annotated counterfactual dataset for both sentiment classification and question paraphrase tasks.
Fairness Sample Complexity and the Case for Human Intervention
Balashankar, Ananth, Lees, Alyssa
With the aim of building machine learning systems that incorporate standards of fairness and accountability, we explore explicit subgroup sample complexity bounds. The work is motivated by the observation that classifier predictions for real world datasets often demonstrate drastically different metrics, such as accuracy, when subdivided by specific sensitive variable subgroups. The reasons for these discrepancies are varied and not limited to the influence of mitigating variables, institutional bias, underlying population distributions as well as sampling bias. Among the numerous definitions of fairness that exist, we argue that at a minimum, principled ML practices should ensure that classification predictions are able to mirror the underlying sub-population distributions. However, as the number of sensitive variables increase, populations meeting at the intersectionality of these variables may simply not exist or may not be large enough to provide accurate samples for classification. In these increasingly likely scenarios, we make the case for human intervention and applying situational and individual definitions of fairness. In this paper we present lower bounds of subgroup sample complexity for metric-fair learning based on the theory of Probably Approximately Metric Fair Learning. We demonstrate that for a classifier to approach a definition of fairness in terms of specific sensitive variables, adequate subgroup population samples need to exist and the model dimensionality has to be aligned with subgroup population distributions. In cases where this is not feasible, we propose an approach using individual fairness definitions for achieving alignment. We look at two commonly explored UCI datasets under this lens and suggest human interventions for data collection for specific subgroups to achieve approximate individual fairness for linear hypotheses.