lower value
Fairness-Aware Low-Rank Adaptation Under Demographic Privacy Constraints
Kamalaruban, Parameswaran, Anderson, Mark, Burrell, Stuart, Madigan, Maeve, Skalski, Piotr, Sutton, David
Pre-trained foundation models can be adapted for specific tasks using Low-Rank Adaptation (LoRA). However, the fairness properties of these adapted classifiers remain underexplored. Existing fairness-aware fine-tuning methods rely on direct access to sensitive attributes or their predictors, but in practice, these sensitive attributes are often held under strict consumer privacy controls, and neither the attributes nor their predictors are available to model developers, hampering the development of fair models. To address this issue, we introduce a set of LoRA-based fine-tuning methods that can be trained in a distributed fashion, where model developers and fairness auditors collaborate without sharing sensitive attributes or predictors. In this paper, we evaluate three such methods - sensitive unlearning, adversarial training, and orthogonality loss - against a fairness-unaware baseline, using experiments on the CelebA and UTK-Face datasets with an ImageNet pre-trained ViT-Base model. We find that orthogonality loss consistently reduces bias while maintaining or improving utility, whereas adversarial training improves False Positive Rate Parity and Demographic Parity in some cases, and sensitive unlearning provides no clear benefit. In tasks where significant biases are present, distributed fairness-aware fine-tuning methods can effectively eliminate bias without compromising consumer privacy and, in most cases, improve model utility.
$\mathtt{GeLLM^3O}$: Generalizing Large Language Models for Multi-property Molecule Optimization
Dey, Vishal, Hu, Xiao, Ning, Xia
Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs' potential for molecule optimization, we introduce $\mathtt{MoMUInstruct}$, the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging $\mathtt{MoMUInstruct}$, we develop $\mathtt{GeLLM^3O}$s, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that $\mathtt{GeLLM^3O}$s consistently outperform state-of-the-art baselines. $\mathtt{GeLLM^3O}$s also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of $\mathtt{GeLLM^3O}$s as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. $\mathtt{MoMUInstruct}$, models, and code are accessible through https://github.com/ninglab/GeLLMO.
Investigating the Impact of Balancing, Filtering, and Complexity on Predictive Multiplicity: A Data-Centric Perspective
Cavus, Mustafa, Biecek, Przemyslaw
The Rashomon effect presents a significant challenge in model selection. It occurs when multiple models achieve similar performance on a dataset but produce different predictions, resulting in predictive multiplicity. This is especially problematic in high-stakes environments, where arbitrary model outcomes can have serious consequences. Traditional model selection methods prioritize accuracy and fail to address this issue. Factors such as class imbalance and irrelevant variables further complicate the situation, making it harder for models to provide trustworthy predictions. Data-centric AI approaches can mitigate these problems by prioritizing data optimization, particularly through preprocessing techniques. However, recent studies suggest preprocessing methods may inadvertently inflate predictive multiplicity. This paper investigates how data preprocessing techniques like balancing and filtering methods impact predictive multiplicity and model stability, considering the complexity of the data. We conduct the experiments on 21 real-world datasets, applying various balancing and filtering techniques, and assess the level of predictive multiplicity introduced by these methods by leveraging the Rashomon effect. Additionally, we examine how filtering techniques reduce redundancy and enhance model generalization. The findings provide insights into the relationship between balancing methods, data complexity, and predictive multiplicity, demonstrating how data-centric AI strategies can improve model performance.
Evaluating Tokenizer Performance of Large Language Models Across Official Indian Languages
Large Language Models (LLMs) based on transformer architectures have revolutionized a variety of domains, with tokenization playing a pivotal role in their pre-processing and fine-tuning stages. In multilingual models, particularly those tailored for Indic languages, effective tokenization is crucial for optimizing performance. This paper presents a comprehensive evaluation of tokenizers used by 12 LLMs across all 22 official languages of India, with a focus on comparing the efficiency of their tokenization processes. We employed the Normalized Sequence Length (NSL) as a key metric in our analysis. Our findings reveal that the SUTRA tokenizer outperforms all other models, including several Indic-specific models, excelling in 14 languages. Notable insights include the SUTRA tokenizer's superior handling of Indic languages, GPT-4o's advancement over its predecessor GPT-4 in processing Indian languages, and the limited performance of Project Indus in certain languages. This study underscores the critical importance of developing targeted tokenization strategies for multilingual and Indic-centric models, laying the groundwork for future improvements in tokenizer design to enhance linguistic coverage and model efficiency.
Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation
Kiyohara, Haruka, Kishimoto, Ren, Kawakami, Kosuke, Kobayashi, Ken, Nakata, Kazuhide, Saito, Yuta
Off-Policy Evaluation (OPE) aims to assess the effectiveness of counterfactual policies using only offline logged data and is often used to identify the top-k promising policies for deployment in online A/B tests. Existing evaluation metrics for OPE estimators primarily focus on the "accuracy" of OPE or that of downstream policy selection, neglecting risk-return tradeoff in the subsequent online policy deployment. To address this issue, we draw inspiration from portfolio evaluation in finance and develop a new metric, called SharpeRatio@k, which measures the risk-return tradeoff of policy portfolios formed by an OPE estimator under varying online evaluation budgets (k). We validate our metric in two example scenarios, demonstrating its ability to effectively distinguish between low-risk and high-risk estimators and to accurately identify the most efficient estimator. This efficient estimator is characterized by its capability to form the most advantageous policy portfolios, maximizing returns while minimizing risks during online deployment, a nuance that existing metrics typically overlook. These experiments offer several interesting directions and suggestions for future OPE research. Reinforcement Learning (RL) has achieved considerable success in a variety of applications requiring sequential decision-making. Nonetheless, its online learning approach is often seen as problematic due to the need for active interaction with the environment, which can be risky, time-consuming, and unethical (Fu et al., 2021; Matsushima et al., 2021). To mitigate these issues, learning new policies offline from existing historical data, known as Offline RL (Levine et al., 2020), is becoming increasingly popular for real-world applications (Qin et al., 2021). Typically, in the offline RL lifecycle, promising candidate policies are initially screened through Off-Policy Evaluation (OPE) (Fu et al., 2020), followed by the selection of the final production policy from the shortlisted candidates using more dependable online A/B tests (Kurenkov & Kolesnikov, 2022), as shown in Figure 1. When evaluating the efficacy of OPE methods, research has largely concentrated on "accuracy" metrics like mean-squared error (MSE) (Uehara et al., 2022; Voloshin et al., 2019), rank correlation (rankcorr) (Fu et al., 2021; Paine et al., 2020), and regret in subsequent policy selection (Doroudi et al., 2017; Tang & Wiens, 2021). However, these existing metrics do not adequately assess the balance between risk and return experienced by an estimator during the online deployment of selected policies. Crucially, MSE and rankcorr fall short in distinguishing whether an estimator is underevaluating nearoptimal policies or overevaluating poor-performing ones, which influence the risk-return dynamics in OPE and policy selection in different ways.
Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection Capability
Zhu, Jianing, Li, Hengzhuang, Yao, Jiangchao, Liu, Tongliang, Xu, Jianliang, Han, Bo
Out-of-distribution (OOD) detection is an indispensable aspect of secure AI when deploying machine learning models in real-world applications. Previous paradigms either explore better scoring functions or utilize the knowledge of outliers to equip the models with the ability of OOD detection. However, few of them pay attention to the intrinsic OOD detection capability of the given model. In this work, we generally discover the existence of an intermediate stage of a model trained on in-distribution (ID) data having higher OOD detection performance than that of its final stage across different settings, and further identify one critical data-level attribution to be learning with the atypical samples. Based on such insights, we propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data. Our method utilizes a mask to figure out the memorized atypical samples, and then finetune the model or prune it with the introduced mask to forget them. Extensive experiments and analysis demonstrate the effectiveness of our method. The code is available at: https://github.com/tmlr-group/Unleashing-Mask.
Comparing Two Samples Through Stochastic Dominance: A Graphical Approach
Arza, Etor, Ceberio, Josu, Irurozki, Ekhiรฑe, Pรฉrez, Aritz
Non-deterministic measurements are common in real-world scenarios: the performance of a stochastic optimization algorithm or the total reward of a reinforcement learning agent in a chaotic environment are just two examples in which unpredictable outcomes are common. These measures can be modeled as random variables and compared among each other via their expected values or more sophisticated tools such as null hypothesis statistical tests. In this paper, we propose an alternative framework to visually compare two samples according to their estimated cumulative distribution functions. First, we introduce a dominance measure for two random variables that quantifies the proportion in which the cumulative distribution function of one of the random variables stochastically dominates the other one. Then, we present a graphical method that decomposes in quantiles i) the proposed dominance measure and ii) the probability that one of the random variables takes lower values than the other. With illustrative purposes, we re-evaluate the experimentation of an already published work with the proposed methodology and we show that additional conclusions (missed by the rest of the methods) can be inferred. Additionally, the software package RVCompare was created as a convenient way of applying and experimenting with the proposed framework.
Direct loss minimization for sparse Gaussian processes
Wei, Yadi, Sheth, Rishit, Khardon, Roni
The Gaussian process (GP) is an attractive Bayesian model for machine learning which combines an elegant formulation with model flexibility and uncertainty quantification. Sparse Gaussian process (sGP) algorithms provide an approximate solution that mitigates the high computational complexity of GP and the variational approximation is the current best practice for such approximations. Recent theoretical work has shown that an alternative approach, direct loss minimization (DLM), which directly minimizes predictive loss, comes with strong guarantees on the expected loss of the algorithm. In this paper we explore this approach experimentally. We develop the DLM algorithm for sGP and show that with appropriate hyperparameter optimization it provides a significant improvement over the variational approach. In particular, optimizing sGP for log loss provides better calibrated predictions for regression, classification and count prediction, and optimizing sGP for square loss improves the mean square error in regression.
Detecting Out-of-Distribution Inputs in Deep Neural Networks Using an Early-Layer Output
Abdelzad, Vahdat, Czarnecki, Krzysztof, Salay, Rick, Denounden, Taylor, Vernekar, Sachin, Phan, Buu
Deep neural networks achieve superior performance in challenging tasks such as image classification. However, deep classifiers tend to incorrectly classify out-of-distribution (OOD) inputs, which are inputs that do not belong to the classifier training distribution. Several approaches have been proposed to detect OOD inputs, but the detection task is still an ongoing challenge. In this paper, we propose a new OOD detection approach that can be easily applied to an existing classifier and does not need to have access to OOD samples. The detector is a one-class classifier trained on the output of an early layer of the original classifier fed with its original training set. We apply our approach to several low- and high-dimensional datasets and compare it to the state-of-the-art detection approaches. Our approach achieves substantially better results over multiple metrics.