"Being good is easy, what is difficult is being just." "We need to defend the interests of those whom we've never met and never will." Note: This article is intended for a general audience to try and elucidate the complicated nature of unfairness in machine learning algorithms. As such, I have tried to explain concepts in an accessible way with minimal use of mathematics, in the hope that everyone can get something out of reading this. Supervised machine learning algorithms are inherently discriminatory. They are discriminatory in the sense that they use information embedded in the features of data to separate instances into distinct categories -- indeed, this is their designated purpose in life. This is reflected in the name for these algorithms which are often referred to as discriminative algorithms (splitting data into categories), in contrast to generative algorithms (generating data from a given category). When we use supervised machine learning, this "discrimination" is used as an aid to help us categorize our data into distinct categories within the data distribution, as illustrated below. Whilst this occurs when we apply discriminative algorithms -- such as support vector machines, forms of parametric regression (e.g.
Sepsis is one of the leading causes of death in Intensive Care Units (ICU). The strategy for treating sepsis involves the infusion of intravenous (IV) fluids and administration of antibiotics. Determining the optimal quantity of IV fluids is a challenging problem due to the complexity of a patient's physiology. In this study, we develop a data-driven optimization solution that derives the optimal quantity of IV fluids for individual patients. The proposed method minimizes the probability of severe outcomes by controlling the prescribed quantity of IV fluids and utilizes human-in-the-loop artificial intelligence. We demonstrate the performance of our model on 1122 ICU patients with sepsis diagnosis extracted from the MIMIC-III dataset. The results show that, on average, our model can reduce mortality by 22%. This study has the potential to help physicians synthesize optimal, patient-specific treatment strategies.
This thesis focuses on the research and development of the Hemodynamic Tissue Signature (HTS) method: an unsupervised machine learning approach to describe the vascular heterogeneity of glioblastomas by means of perfusion MRI analysis. The HTS builds on the concept of habitats. An habitat is defined as a sub-region of the lesion with a particular MRI profile describing a specific physiological behavior. The HTS method delineates four habitats within the glioblastoma: the High Angiogenic Tumor (HAT) habitat, as the most perfused region of the enhancing tumor; the Low Angiogenic Tumor (LAT) habitat, as the region of the enhancing tumor with a lower angiogenic profile; the potentially Infiltrated Peripheral Edema (IPE) habitat, as the non-enhancing region adjacent to the tumor with elevated perfusion indexes; and the Vasogenic Peripheral Edema (VPE) habitat, as the remaining edema of the lesion with the lowest perfusion profile. The results of this thesis have been published in ten scientific contributions, including top-ranked journals and conferences in the areas of Medical Informatics, Statistics and Probability, Radiology & Nuclear Medicine, Machine Learning and Data Mining and Biomedical Engineering. An industrial patent registered in Spain (ES201431289A), Europe (EP3190542A1) and EEUU (US20170287133A1) was also issued, summarizing the efforts of the thesis to generate tangible assets besides the academic revenue obtained from research publications. Finally, the methods, technologies and original ideas conceived in this thesis led to the foundation of ONCOANALYTICS CDX, a company framed into the business model of companion diagnostics for pharmaceutical compounds, thought as a vehicle to facilitate the industrialization of the ONCOhabitats technology.
Friedrich, Sarah, Antes, Gerd, Behr, Sigrid, Binder, Harald, Brannath, Werner, Dumpert, Florian, Ickstadt, Katja, Kestler, Hans, Lederer, Johannes, Leitgöb, Heinz, Pauly, Markus, Steland, Ansgar, Wilhelm, Adalbert, Friede, Tim
The research on and application of artificial intelligence (AI) has triggered a comprehensive scientific, economic, social and political discussion. Here we argue that statistics, as an interdisciplinary scientific field, plays a substantial role both for the theoretical and practical understanding of AI and for its future development. Statistics might even be considered a core element of AI. With its specialist knowledge of data evaluation, starting with the precise formulation of the research question and passing through a study design stage on to analysis and interpretation of the results, statistics is a natural partner for other disciplines in teaching, research and practice. This paper aims at contributing to the current discussion by highlighting the relevance of statistical methodology in the context of AI development. In particular, we discuss contributions of statistics to the field of artificial intelligence concerning methodological development, planning and design of studies, assessment of data quality and data collection, differentiation of causality and associations and assessment of uncertainty in results. Moreover, the paper also deals with the equally necessary and meaningful extension of curricula in schools and universities.
Communicating effectively with customers is a challenge for many marketers, but especially in a context that is both pivotal to individual long-term financial well-being and difficult to understand: pensions. Around the world, participants are reluctant to consider their pension in advance, it leads to a lack of preparation of their pension retirement , . In order to engage participants to obtain information on their expected pension benefits, personalizing the pension providers' email communication is a first and crucial step. We describe a machine learning approach to model email newsletters to fit participants' interests. The data for the modeling and analysis is collected from newsletters sent by a large Dutch pension provider of the Netherlands and is divided into two parts. The first part comprises 2,228,000 customers whereas the second part comprises the data of a pilot study, which took place in July 2018 with 465,711 participants. In both cases, our algorithm extracts features from continuous and categorical data using random forests, and then calculates node embeddings of the decision boundaries of the random forest. We illustrate the algorithm's effectiveness for the classification task, and how it can be used to perform data mining tasks. In order to confirm that the result is valid for more than one data set, we also illustrate the properties of our algorithm in benchmark data sets concerning churning. In the data sets considered, the proposed modeling demonstrates competitive performance with respect to other state of the art approaches based on random forests, achieving the best Area Under the Curve (AUC) in the pension data set (0.948). For the descriptive part, the algorithm can identify customer segmentations that can be used by marketing departments to better target their communication towards their customers.
A 2016 report claims that annually upwards of 235 000 Canadians endure periods of homelessness, with approximately 35 000 individuals lacking a place to stay each night . Between 2005 and 2014, there was a downward trend in the total number of Canadians using shelters; however, the occupancy rates of shelters has been increasing . One factor accounting for this ongoing decrease in the number of homeless individuals paired with an increase in shelter occupancy is an increase in chronic homelessness. London's Homeless Prevention division identifies an individual as chronically homelessness if they have spent 6 or more months ( 180 days) of the last year in a shelter, which was based on the definition of chronic homelessness outlined by the Canadian government's homelessness strategy directives . In addition to this trend, the demographics of homelessness are changing in Canada. In preceding decades, older, single males are over-represented in the homeless population; in contrast, the homeless population of today is increasingly diverse, with families, women, and youth comprising a greater fraction .
Being good is easy, what is difficult is being just. We need to defend the interests of those whom we've never met and never will. Note: This article is intended for a general audience to try and elucidate the complicated nature of unfairness in machine learning algorithms. As such, I have tried to explain concepts in an accessible way with minimal use of mathematics, in the hope that everyone can get something out of reading this. Supervised machine learning algorithms are inherently discriminatory. They are discriminatory in the sense that they use information embedded in the features of data to separate instances into distinct categories -- indeed, this is their designated purpose in life. This is reflected in the name for these algorithms which are often referred to as discriminative algorithms (splitting data into categories), in contrast to generative algorithms (generating data from a given category). When we use supervised machine learning, this "discrimination" is used as an aid to help us categorize our data into distinct categories within the data distribution, as illustrated below. Whilst this occurs when we apply discriminative algorithms -- such as support vector machines, forms of parametric regression (e.g. For example, using last week's weather data to try and predict the weather tomorrow has no moral valence attached to it.
Millions of mosques around the world are suffering some problems such as ventilation and difficulty getting rid of bacteria, especially in rush hours where congestion in mosques leads to air pollution and spread of bacteria, in addition to unpleasant odors and to a state of discomfort during the pray times, where in most mosques there are no enough windows to ventilate the mosque well. This paper aims to solve these problems by building a model of smart mosques domes using weather features and outside temperatures. Machine learning algorithms such as k Nearest Neighbors and Decision Tree were applied to predict the state of the domes open or close. The experiments of this paper were applied on Prophet mosque in Saudi Arabia, which basically contains twenty seven manually moving domes. Both machine learning algorithms were tested and evaluated using different evaluation methods. After comparing the results for both algorithms, DT algorithm was achieved higher accuracy 98% comparing with 95% accuracy for kNN algorithm. Finally, the results of this study were promising and will be helpful for all mosques to use our proposed model for controlling domes automatically.
Dota 2 is a popular, multiplayer online video game. Like many online games, players are mostly anonymous, being tied only to online accounts which can be readily obtained, sold and shared between multiple people. This makes it difficult to track or ban players who exhibit unwanted behavior online. In this paper, we present a machine learning approach to identify players based a `digital fingerprint' of how they play the game, rather than by account. We use data on mouse movements, in-game statistics and game strategy extracted from match replays and show that for best results, all of these are necessary. We are able to obtain an accuracy of prediction of 95\% for the problem of predicting if two different matches were played by the same player.
Predicting a customer's propensity-to-pay at an early point in the revenue cycle can provide organisations many opportunities to improve the customer experience, reduce hardship and reduce the risk of impaired cash flow and occurrence of bad debt. With the advancements in data science; machine learning techniques can be used to build models to accurately predict a customer's propensity-to-pay. Creating effective machine learning models without access to large and detailed datasets presents some significant challenges. This paper presents a case-study, conducted on a dataset from an energy organisation, to explore the uncertainty around the creation of machine learning models that are able to predict residential customers entering financial hardship which then reduces their ability to pay energy bills. Incorrect predictions can result in inefficient resource allocation and vulnerable customers not being proactively identified. This study investigates machine learning models' ability to consider different contexts and estimate the uncertainty in the prediction. Seven models from four families of machine learning algorithms are investigated for their novel utilisation. A novel concept of utilising a Baysian Neural Network to the binary classification problem of propensity-to-pay energy bills is proposed and explored for deployment.