Statistical Learning

Mitigating biases in machine learning


Machine Learning (ML) is increasingly being used to simplify and automate a number of important computational tasks in modern society. From the disbursement of bank loans to job application screenings, these computer systems streamline several processes that have a considerable impact on our day to day lives. However, these artificially intelligent systems are most often devised to emulate human decision making -- an inherently biased framework. For example, Microsoft's Tay online chatbot quickly learned to tweet using racial slurs as a result of the biased online input stream (Caton and Haas 2020), and the COMPAS tool often flagged black individuals as more likely to commit a crime (even if two individuals were statistically similar with respect to many other attributes) (Flores, Bechtel and Lowenkamp 2016). Crucially, these issues are not the product of a malevolent computer programmer instilling radical beliefs, but rather a byproduct of machines learning to optimize for a particular objective, which can inadvertently leverage underlying biases present in the data.

Fairness in Forecasting of Observations of Linear Dynamical Systems

Journal of Artificial Intelligence Research

In machine learning, training data often capture the behaviour of multiple subgroups of some underlying human population. This behaviour can often be modelled as observations of an unknown dynamical system with an unobserved state. When the training data for the subgroups are not controlled carefully, however, under-representation bias arises. To counter under-representation bias, we introduce two natural notions of fairness in timeseries forecasting problems: subgroup fairness and instantaneous fairness. These notion extend predictive parity to the learning of dynamical systems. We also show globally convergent methods for the fairness-constrained learning problems using hierarchies of convexifications of non-commutative polynomial optimisation problems. We also show that by exploiting sparsity in the convexifications, we can reduce the run time of our methods considerably. Our empirical results on a biased data set motivated by insurance applications and the well-known COMPAS data set demonstrate the efficacy of our methods.

Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach

Journal of Artificial Intelligence Research

Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.



Over the past few decades, the advances in computational resources and computer science, combined with next-generation sequencing and other emerging omics techniques, ushered in a new era of biology, allowing for sophisticated analysis of complex biological data. Bioinformatics is evolving as an integrative field between computer science and biology, that allows the representation, storage, management, analysis and investigation of numerous data types with diverse algorithms and computational tools. The bioinformatics approaches include sequence analysis, comparative genomics, molecular evolution studies and phylogenetics, protein and RNA structure prediction, gene expression and regulation analysis, and biological network analysis, as well as the genetics of human diseases, in particular, cancer, and medical image analysis [1,2,3]. Machine learning (ML) is a field in computer science that studies the use of computers to simulate human learning by exploring patterns in the data and applying self-improvement to continually enhance the performance of learning tasks. ML algorithms can be roughly divided into supervised learning algorithms, which learn to map input example into their respective output, and unsupervised learning algorithms, which identify hidden patterns in unlabeled data. The advances made in machine-learning over the past decade transformed the landscape of data analysis [4,5,6].

Decentralized Gradient-Quantization Based Matrix Factorization for Fast Privacy-Preserving Point-of-Interest Recommendation

Journal of Artificial Intelligence Research

With the rapidly growing of location-based social networks, point-of-interest (POI) recommendation has been attracting tremendous attentions. Previous works for POI recommendation usually use matrix factorization (MF)-based methods, which achieve promising performance. However, existing MF-based methods suffer from two critical limitations: (1) Privacy issues: all users’ sensitive data are collected to the centralized server which may leak on either the server side or during transmission. (2) Poor resource utilization and training efficiency: training on centralized server with potentially huge low-rank matrices is computational inefficient. In this paper, we propose a novel decentralized gradient-quantization based matrix factorization (DGMF) framework to address the above limitations in POI recommendation. Compared with the centralized MF methods which store all sensitive data and low-rank matrices during model training, DGMF treats each user’s device (e.g., phone) as an independent learner and keeps the sensitive data on each user’s end. Furthermore, a privacy-preserving and communication-efficient mechanism with gradient-quantization technique is presented to train the proposed model, which aims to handle the privacy problem and reduces the communication cost in the decentralized setting. Theoretical guarantees of the proposed algorithm and experimental studies on real-world datasets demonstrate the effectiveness of the proposed algorithm.

TERI School of Advanced Studies - Masters and Ph.D in Delhi


Master of Science in Geoinformatics at TERI SAS is a two years interdisciplinary program for students who want to develop expertise in and applying geospatial technologies to solve world's most pressing real-world challenges in environmental, social and economic domains. Geoinformatics is a rapidly evolving field that brings meaningful insights to solve real world problems by bringing together technologies and tools required for acquisition, exploration, visualization, analysis and integration of various spatial data. There are several components of Geoinformatics that include cartographic geovisualization, GIS, Remote sensing, photogrammetry, spatial statistics, geostatistics, multivariate statistics and other advanced tools and techniques. The core strength of the programme lies in its innovative curriculum that imbues present and future professionals on development and the use of cutting-edge geospatial technologies to emulate real-life problems. Over the period of two years, students gain sound knowledge in the scientific principles behind computational and analytical foundation of Geoinformatics as well as its applications in domains such as conservation biology, urban planning, meteorology and natural resource management through hands-on exercises, training programmes, 8 weeks summer internship, independent study and a semester long major project.

Exploring Unsupervised Learning Metrics - KDnuggets


Unsupervised learning is a branch of machine learning where the models learn patterns from the available data rather than provided with the actual label. We let the algorithm come up with the answers. In unsupervised learning, there are two main techniques; clustering and dimensionality reduction. The clustering technique uses an algorithm to learn the pattern to segment the data. In contrast, the dimensionality reduction technique tries to reduce the number of features by keeping the actual information intact as much as possible.

Principal Components Regression in R (Step-by-Step)


However, when the predictor variables are highly correlated then multicollinearity can become a problem. This can cause the coefficient estimates of the model to be unreliable and have high variance. One way to avoid this problem is to instead use principal components regression, which finds M linear combinations (known as "principal components") of the original p predictors and then uses least squares to fit a linear regression model using the principal components as predictors. This tutorial provides a step-by-step example of how to perform principal components regression in R. The easiest way to perform principal components regression in R is by using functions from the pls package. For this example, we'll use the built-in R dataset called mtcars which contains data about various types of cars: For this example we'll fit a principal components regression (PCR) model using hp as the response variable and the following variables as the predictor variables: The following code shows how to fit the PCR model to this data.

Unveiling Machine Learning: Unlock the True Potential of AI Technologies - Devops7


As the core of AI technologies, machine learning has become a significant force driving advancements in various industries. This article will give you an in-depth understanding of machine learning, its applications, techniques, and potential. Whether you're a beginner or an experienced professional, this guide will help you master the world of machine learning. I will refer to machine learning as ML going forward. ML is a subset of artificial intelligence (AI) that enables computers to learn and make decisions without being explicitly programmed.

The Image of the M87 Black Hole Reconstructed with PRIMO - IOPscience


The exceptional resolution achieved by the EHT is made possible by an array of telescopes spanning the Earth and operating as a very long baseline interferometer (VLBI; Event Horizon Telescope Collaboration et al. 2019b, 2019c). Despite this global reach, the sparse interferometric coverage of the EHT array (especially during the 2017 observations that have been used for all of the publications to date) makes the already complex problem of interferometric image reconstruction particularly challenging. In such situations, special care is needed to assess the impact of imaging algorithms and sparse interferometric data on the final set of images that can be reconstructed from it. A cornerstone of the EHT data analysis strategy was the use of several independent analysis methods, each with different priorities, assumptions, and choices, to ensure that the EHT results were robust to these differences. The use of several general-purpose imaging algorithms, for example, was motivated by a desire to reconstruct an image that was consistent with the EHT data while remaining model-agnostic.