Goto

Collaborating Authors

 absolute bias


UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge

arXiv.org Artificial Intelligence

Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. This bias leads to inconsistent and skewed rankings across different judges. To address this, we first empirically demonstrate significant and heterogeneous biases in cross-model evaluations. We then propose UDA (Unsupervised Debiasing Alignment), a framework that reduces inter-judge disagreement by dynamically adjusting the Elo rating system. For each pairwise comparison, a compact neural network learns to adaptively set the K-factor and refine win probabilities. Crucially, UDA operates in a fully unsupervised manner, guided solely by the objective of minimizing the dispersion among the Elo trajectories of all judges. This forces an alignment towards a collective consensus, which serves as an unsupervised proxy for a more stable and reproducible evaluation. In addition, we provide theoretical motivation demonstrating how alignment towards a consensus can reduce aggregate system bias. Experiments show that UDA significantly reduces the inter-judge rating standard deviation by up to 63.4% and improves the average correlation with human judgments by 24.7%. Notably, UDA elevates the performance of poorly performing judges to achieve parity with high-quality ones, fostering a more robust and reliable evaluation ecosystem. Code and data are available at https://anonymous.4open.science/r/62AB93CD-23B4.


BiasAsker: Measuring the Bias in Conversational AI System

arXiv.org Artificial Intelligence

Powered by advanced Artificial Intelligence (AI) techniques, conversational AI systems, such as ChatGPT and digital assistants like Siri, have been widely deployed in daily life. However, such systems may still produce content containing biases and stereotypes, causing potential social problems. Due to the data-driven, black-box nature of modern AI techniques, comprehensively identifying and measuring biases in conversational systems remains a challenging task. Particularly, it is hard to generate inputs that can comprehensively trigger potential bias due to the lack of data containing both social groups as well as biased properties. In addition, modern conversational systems can produce diverse responses (e.g., chatting and explanation), which makes existing bias detection methods simply based on the sentiment and the toxicity hardly being adopted. In this paper, we propose BiasAsker, an automated framework to identify and measure social bias in conversational AI systems. To obtain social groups and biased properties, we construct a comprehensive social bias dataset, containing a total of 841 groups and 8,110 biased properties. Given the dataset, BiasAsker automatically generates questions and adopts a novel method based on existence measurement to identify two types of biases (i.e., absolute bias and related bias) in conversational systems. Extensive experiments on 8 commercial systems and 2 famous research models, such as ChatGPT and GPT-3, show that 32.83% of the questions generated by BiasAsker can trigger biased behaviors in these widely deployed conversational systems. All the code, data, and experimental results have been released to facilitate future research.


Meta-Learners for Estimation of Causal Effects: Finite Sample Cross-Fit Performance

arXiv.org Machine Learning

In recent years there has been a growing interest in the estimation of causal effects using machine learning algorithms, particularly in the field of economics (Athey, 2018). The newly emerging synthesis of machine learning methods with causal inference has a large potential for a more comprehensive estimation of causal effects (Lechner, 2018). On the one hand, it enables a more flexible estimation of average effects which are of main interest in microeconometrics (Imbens & Wooldridge, 2009). On the other hand, it advances the estimation beyond the average effects and allows for a systematic analysis of effect heterogeneity (Athey & Imbens, 2017). Both of these aspects contribute to a better description of the causal mechanisms and thus to a possibly more efficient treatment allocation (Zhao, Zeng, Rush, & Kosorok, 2012; Kitagawa & Tetenov, 2018; Athey & Wager, 2021; Nie, Brunskill, & Wager, 2021). Hence, applied empirical researchers can greatly benefit from the usage of machine learning methods ranging from evaluation of public policies and business decisions to designing personalized interventions (Andini, Ciani, de Blasio, D'Ignazio, & Salvestrini, 2018; Bansak et al., 2018). Machine learning estimators as such are, however, primarily designed for prediction problems and thus cannot be used directly for causal inference. Therefore, new approaches for the estimation of causal parameters using machine learning emerged (see Athey & Imbens, 2019, for an overview). In particular, the development of the so-called meta-learners have received considerable attention (see e.g.


Decision Making with Differential Privacy under a Fairness Lens

arXiv.org Artificial Intelligence

Agencies, such as the U.S. Census Bureau, release data sets and statistics about groups of individuals that are used as input to a number of critical decision processes. To conform with privacy and confidentiality requirements, these agencies are often required to release privacy-preserving versions of the data. This paper studies the release of differentially private data sets and analyzes their impact on some critical resource allocation tasks under a fairness perspective. The paper shows that, when the decisions take as input differentially private data, the noise added to achieve privacy disproportionately impacts some groups over others. The paper analyzes the reasons for these disproportionate impacts and proposes guidelines to mitigate these effects. The proposed approaches are evaluated on critical decision problems that use differentially private census data.


Missing Mass of Rank-2 Markov Chains

arXiv.org Machine Learning

Estimation of missing mass with the popular Good-Turing (GT) estimator is well-understood in the case where samples are independent and identically distributed (iid). In this article, we consider the same problem when the samples come from a stationary Markov chain with a rank-2 transition matrix, which is one of the simplest extensions of the iid case. We develop an upper bound on the absolute bias of the GT estimator in terms of the spectral gap of the chain and a tail bound on the occupancy of states. Borrowing tail bounds from known concentration results for Markov chains, we evaluate the bound using other parameters of the chain. The analysis, supported by simulations, suggests that, for rank-2 irreducible chains, the GT estimator has bias and mean-squared error falling with number of samples at a rate that depends loosely on the connectivity of the states in the chain.


Statistical Estimation of Malware Detection Metrics in the Absence of Ground Truth

arXiv.org Machine Learning

The accurate measurement of security metrics is a critical research problem because an improper or inaccurate measurement process can ruin the usefulness of the metrics, no matter how well they are defined. This is a highly challenging problem particularly when the ground truth is unknown or noisy. In contrast to the well perceived importance of defining security metrics, the measurement of security metrics has been little understood in the literature. In this paper, we measure five malware detection metrics in the {\em absence} of ground truth, which is a realistic setting that imposes many technical challenges. The ultimate goal is to develop principled, automated methods for measuring these metrics at the maximum accuracy possible. The problem naturally calls for investigations into statistical estimators by casting the measurement problem as a {\em statistical estimation} problem. We propose statistical estimators for these five malware detection metrics. By investigating the statistical properties of these estimators, we are able to characterize when the estimators are accurate, and what adjustments can be made to improve them under what circumstances. We use synthetic data with known ground truth to validate these statistical estimators. Then, we employ these estimators to measure five metrics with respect to a large dataset collected from VirusTotal. We believe our study touches upon a vital problem that has not been paid due attention and will inspire many future investigations.