Goto

Collaborating Authors

 Binkyte, Ruta


On the Origins of Sampling Bias: Implications on Fairness Measurement and Mitigation

arXiv.org Artificial Intelligence

Accurately measuring discrimination is crucial to faithfully assessing fairness of trained machine learning (ML) models. Any bias in measuring discrimination leads to either amplification or underestimation of the existing disparity. Several sources of bias exist and it is assumed that bias resulting from machine learning is born equally by different groups (e.g. females vs males, whites vs blacks, etc.). If, however, bias is born differently by different groups, it may exacerbate discrimination against specific sub-populations. Sampling bias, in particular, is inconsistently used in the literature to describe bias due to the sampling procedure. In this paper, we attempt to disambiguate this term by introducing clearly defined variants of sampling bias, namely, sample size bias (SSB) and underrepresentation bias (URB). Through an extensive set of experiments on benchmark datasets and using mainstream learning algorithms, we expose relevant observations in several model training scenarios. The observations are finally framed as actionable recommendations for practitioners.


Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models

arXiv.org Artificial Intelligence

Ensuring trustworthiness in machine learning (ML) systems is crucial as they become increasingly embedded in high-stakes domains. This paper advocates for integrating causal methods into machine learning to navigate the trade-offs among key principles of trustworthy ML, including fairness, privacy, robustness, accuracy, and explainability. While these objectives should ideally be satisfied simultaneously, they are often addressed in isolation, leading to conflicts and suboptimal solutions. Drawing on existing applications of causality in ML that successfully align goals such as fairness and accuracy or privacy and robustness, this paper argues that a causal approach is essential for balancing multiple competing objectives in both trustworthy ML and foundation models. Beyond highlighting these trade-offs, we examine how causality can be practically integrated into ML and foundation models, offering solutions to enhance their reliability and interpretability. Finally, we discuss the challenges, limitations, and opportunities in adopting causal frameworks, paving the way for more accountable and ethically sound AI systems.


Safety is Essential for Responsible Open-Ended Systems

arXiv.org Artificial Intelligence

AI advancements have been significantly driven by a combination of foundation models and curiosity-driven learning aimed at increasing capability and adaptability. A growing area of interest within this field is Open-Endedness - the ability of AI systems to continuously and autonomously generate novel and diverse artifacts or solutions. This has become relevant for accelerating scientific discovery and enabling continual adaptation in AI agents. This position paper argues that the inherently dynamic and self-propagating nature of Open-Ended AI introduces significant, underexplored risks, including challenges in maintaining alignment, predictability, and control. This paper systematically examines these challenges, proposes mitigation strategies, and calls for action for different stakeholders to support the safe, responsible and successful development of Open-Ended AI.


LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs -- Evaluation through Synthetic Data Generation

arXiv.org Artificial Intelligence

Gene regulatory networks (GRNs) represent the causal relationships between transcription factors (TFs) and target genes in single-cell RNA sequencing (scRNA-seq) data. Understanding these networks is crucial for uncovering disease mechanisms and identifying therapeutic targets. In this work, we investigate the potential of large language models (LLMs) for GRN discovery, leveraging their learned biological knowledge alone or in combination with traditional statistical methods. We develop a task-based evaluation strategy to address the challenge of unavailable ground truth causal graphs. Specifically, we use the GRNs suggested by LLMs to guide causal synthetic data generation and compare the resulting data against the original dataset. Our statistical and biological assessments show that LLMs can support statistical modeling and data synthesis for biological research. Single-cell RNA sequencing (scRNA-seq) is a cutting-edge technology that enables the collection of gene expression data from individual cells. This approach opens up new avenues for a wide range of scientific and clinical applications. One crucial application of scRNA-seq data is the reconstruction and analysis of gene regulatory networks (GRNs), which represent the interactions between genes. GRN analysis can deepen our understanding of disease mechanisms, identify key regulatory pathways, and provide a foundation for the development of interventional gene therapies and targeted drug discovery. Statistical causal discovery algorithms (Scheines et al., 1998; Zheng et al., 2018; Mercatelli et al., 2020; Brouillard et al., 2020; Lippe et al., 2021; Yu & Welch, 2022; Roohani et al., 2024) can reveal potential causal links between TFs and their target gene. However, they often lack robustness and are prone to detecting spurious correlations, especially in high-dimensional, noisy single-cell data. Furthermore, many of these approaches rely heavily on prior knowledge from curated databases (e.g., TRANSFAC (Wingender et al., 1996), RegNetwork (Liu et al., 2015), ENCODE (de Souza, 2012), BioGRID (de Souza, 2012), and AnimalTFDB (Hu et al., 2019)), which frequently lack essential contextual information such as specific cell types or conditions, leading to inaccuracies in the inferred regulatory relationships (Zinati et al., 2024). Most of the above methods involve the refinement of the statistically inferred causal graph by LLM.


BaBE: Enhancing Fairness via Estimation of Latent Explaining Variables

arXiv.org Artificial Intelligence

We consider the problem of unfair discrimination between two groups and propose a pre-processing method to achieve fairness. Corrective methods like statistical parity usually lead to bad accuracy and do not really achieve fairness in situations where there is a correlation between the sensitive attribute S and the legitimate attribute E (explanatory variable) that should determine the decision. To overcome these drawbacks, other notions of fairness have been proposed, in particular, conditional statistical parity and equal opportunity. However, E is often not directly observable in the data, i.e., it is a latent variable. We may observe some other variable Z representing E, but the problem is that Z may also be affected by S, hence Z itself can be biased. To deal with this problem, we propose BaBE (Bayesian Bias Elimination), an approach based on a combination of Bayes inference and the Expectation-Maximization method, to estimate the most likely value of E for a given Z for each group. The decision can then be based directly on the estimated E. We show, by experiments on synthetic and real data sets, that our approach provides a good level of fairness as well as high accuracy.