Goto

Collaborating Authors

 Performance Analysis


Why resampling outperforms reweighting for correcting sampling bias

arXiv.org Machine Learning

A data set sampled from a certain population is biased if the subgroups of the population are sampled at proportions that are significantly different from their underlying proportions. Training machine learning models on biased data sets requires correction techniques to compensate for potential biases. We consider two commonly-used techniques, resampling and reweighting, that rebalance the proportions of the subgroups to maintain the desired objective function. Though statistically equivalent, it has been observed that reweighting outperforms resampling when combined with stochastic gradient algorithms. By analyzing illustrative examples, we explain the reason behind this phenomenon using tools from dynamical stability and stochastic asymptotics. We also present experiments from regression, classification, and off-policy prediction to demonstrate that this is a general phenomenon. We argue that it is imperative to consider the objective function design and the optimization algorithm together while addressing the sampling bias.


Balancing thermal comfort datasets: We GAN, but should we?

arXiv.org Machine Learning

Thermal comfort assessment for the built environment has become more available to analysts and researchers due to the proliferation of sensors and subjective feedback methods. These data can be used for modeling comfort behavior to support design and operations towards energy efficiency and well-being. By nature, occupant subjective feedback is imbalanced as indoor conditions are designed for comfort, and responses indicating otherwise are less common. This situation creates a scenario for the machine learning workflow where class balancing as a pre-processing step might be valuable for developing predictive thermal comfort classification models with high-performance. This paper investigates the various thermal comfort dataset class balancing techniques from the literature and proposes a modified conditional Generative Adversarial Network (GAN), $\texttt{comfortGAN}$, to address this imbalance scenario. These approaches are applied to three publicly available datasets, ranging from 30 and 67 participants to a global collection of thermal comfort datasets, with 1,474; 2,067; and 66,397 data points, respectively. This work finds that a classification model trained on a balanced dataset, comprised of real and generated samples from $\texttt{comfortGAN}$, has higher performance (increase between 4% and 17% in classification accuracy) than other augmentation methods tested. However, when classes representing discomfort are merged and reduced to three, better imbalanced performance is expected, and the additional increase in performance by $\texttt{comfortGAN}$ shrinks to 1-2%. These results illustrate that class balancing for thermal comfort modeling is beneficial using advanced techniques such as GANs, but its value is diminished in certain scenarios. A discussion is provided to assist potential users in determining which scenarios this process is useful and which method works best.


Deep Learning-Based Automatic Detection of Poorly Positioned Mammograms to Minimize Patient Return Visits for Repeat Imaging: A Real-World Application

arXiv.org Artificial Intelligence

Screening mammograms are a routine imaging exam performed to detect breast cancer in its early stages to reduce morbidity and mortality attributed to this disease. In order to maximize the efficacy of breast cancer screening programs, proper mammographic positioning is paramount. Proper positioning ensures adequate visualization of breast tissue and is necessary for effective breast cancer detection. Therefore, breast-imaging radiologists must assess each mammogram for the adequacy of positioning before providing a final interpretation of the examination; this often necessitates return patient visits for additional imaging. In this paper, we propose a deep learning-algorithm method that mimics and automates this decision-making process to identify poorly positioned mammograms. Our objective for this algorithm is to assist mammography technologists in recognizing inadequately positioned mammograms real-time, improve the quality of mammographic positioning and performance, and ultimately reducing repeat visits for patients with initially inadequate imaging. The proposed model showed a true positive rate for detecting correct positioning of 91.35% in the mediolateral oblique view and 95.11% in the craniocaudal view. In addition to these results, we also present an automatically generated report which can aid the mammography technologist in taking corrective measures during the patient visit.


RENT -- Repeated Elastic Net Technique for Feature Selection

arXiv.org Machine Learning

In this study we present the RENT feature selection method for binary classification and regression problems. We compare the performance of RENT to a number of other state-of-the-art feature selection methods on eight datasets (six for binary classification and two for regression) to illustrate RENT's performance with regard to prediction and reduction of total number of features. At its core RENT trains an ensemble of unique models using regularized elastic net to select features. Each model in the ensemble is trained with a unique and randomly selected subset from the full training data. From these models one can acquire weight distributions for each feature that contain rich information on the stability of feature selection and from which several adjustable classification criteria may be defined. Moreover, we acquire distributions of class predictions for each sample across many models in the ensemble. Analysis of these distributions may provide useful insight into which samples are more difficult to classify correctly than others. Overall, results from the tested datasets show that RENT not only can compete on-par with the best performing feature selection methods in this study, but also provides valuable insights into the stability of feature selection and sample classification.


Programming Fairness in Algorithms

#artificialintelligence

"Being good is easy, what is difficult is being just." "We need to defend the interests of those whom we've never met and never will." Note: This article is intended for a general audience to try and elucidate the complicated nature of unfairness in machine learning algorithms. As such, I have tried to explain concepts in an accessible way with minimal use of mathematics, in the hope that everyone can get something out of reading this. Supervised machine learning algorithms are inherently discriminatory. They are discriminatory in the sense that they use information embedded in the features of data to separate instances into distinct categories -- indeed, this is their designated purpose in life. This is reflected in the name for these algorithms which are often referred to as discriminative algorithms (splitting data into categories), in contrast to generative algorithms (generating data from a given category). When we use supervised machine learning, this "discrimination" is used as an aid to help us categorize our data into distinct categories within the data distribution, as illustrated below. Whilst this occurs when we apply discriminative algorithms -- such as support vector machines, forms of parametric regression (e.g.


Will AutoML software replace Data Scientists?

#artificialintelligence

AutoML is a generic expression to indicate pieces of software that perform Machine Learning tasks automatically. Such pieces of software can be Python libraries like Auto-Sklearn or software programs like Data Robot. AutoML pieces of software replace all the boring steps that take more time to a Data Scientist's work. They actually make all the combinations of the several parameters of a pipeline (e.g. the blank filling values, scaling algorithm, model type, model hyperparameters) and select the best combination that maximizes some performance metrics (like RMSE or Area under the ROC Curve) in k-fold cross-validation using some search algorithm (like Grid or Random Search). They can really simplify the life of somebody that has to create a model from scratch and sometimes they explore combinations and scenarios that a Data Scientist may not have thought of.


Fairness in Semi-supervised Learning: Unlabeled Data Help to Reduce Discrimination

arXiv.org Machine Learning

A growing specter in the rise of machine learning is whether the decisions made by machine learning models are fair. While research is already underway to formalize a machine-learning concept of fairness and to design frameworks for building fair models with sacrifice in accuracy, most are geared toward either supervised or unsupervised learning. Yet two observations inspired us to wonder whether semi-supervised learning might be useful to solve discrimination problems. First, previous study showed that increasing the size of the training set may lead to a better trade-off between fairness and accuracy. Second, the most powerful models today require an enormous of data to train which, in practical terms, is likely possible from a combination of labeled and unlabeled data. Hence, in this paper, we present a framework of fair semi-supervised learning in the pre-processing phase, including pseudo labeling to predict labels for unlabeled data, a re-sampling method to obtain multiple fair datasets and lastly, ensemble learning to improve accuracy and decrease discrimination. A theoretical decomposition analysis of bias, variance and noise highlights the different sources of discrimination and the impact they have on fairness in semi-supervised learning. A set of experiments on real-world and synthetic datasets show that our method is able to use unlabeled data to achieve a better trade-off between accuracy and discrimination.


AI, Protests, and Justice

#artificialintelligence

Editor's Note: The use of face recognition technology in policing has been a long-standing subject of concern, even more-so now after the murder of George Floyd and the demonstrations that have followed. In this article, Mike Loukides, VP of Content Strategy at O'Reilly Media, reviews how companies and cities have addressed these concerns, as well as ways in which individuals can mitigate face recognition technology or even use it to increase accountability. We'd love to hear from you about what you think about this piece. Largely on the impetus of the Black Lives Matter movement, the public's response to the murder of George Floyd, and the subsequent demonstrations, we've seen increased concern about the use of facial identification in policing. First, in a highly publicized wave of announcements, IBM, Microsoft, and Amazon have announced that they will not sell face recognition technology to police forces.


Self-Weighted Robust LDA for Multiclass Classification with Edge Classes

arXiv.org Machine Learning

Linear discriminant analysis (LDA) is a popular technique to learn the most discriminative features for multi-class classification. A vast majority of existing LDA algorithms are prone to be dominated by the class with very large deviation from the others, i.e., edge class, which occurs frequently in multi-class classification. First, the existence of edge classes often makes the total mean biased in the calculation of between-class scatter matrix. Second, the exploitation of l2-norm based between-class distance criterion magnifies the extremely large distance corresponding to edge class. In this regard, a novel self-weighted robust LDA with l21-norm based pairwise between-class distance criterion, called SWRLDA, is proposed for multi-class classification especially with edge classes. SWRLDA can automatically avoid the optimal mean calculation and simultaneously learn adaptive weights for each class pair without setting any additional parameter. An efficient re-weighted algorithm is exploited to derive the global optimum of the challenging l21-norm maximization problem. The proposed SWRLDA is easy to implement, and converges fast in practice. Extensive experiments demonstrate that SWRLDA performs favorably against other compared methods on both synthetic and real-world datasets, while presenting superior computational efficiency in comparison with other techniques.


BreachRadar: Automatic Detection of Points-of-Compromise

arXiv.org Machine Learning

Bank transaction fraud results in over $13B annual losses for banks, merchants, and card holders worldwide. Much of this fraud starts with a Point-of-Compromise (a data breach or a skimming operation) where credit and debit card digital information is stolen, resold, and later used to perform fraud. We introduce this problem and present an automatic Points-of-Compromise (POC) detection procedure. BreachRadar is a distributed alternating algorithm that assigns a probability of being compromised to the different possible locations. We implement this method using Apache Spark and show its linear scalability in the number of machines and transactions. BreachRadar is applied to two datasets with billions of real transaction records and fraud labels where we provide multiple examples of real Points-of-Compromise we are able to detect. We further show the effectiveness of our method when injecting Points-of-Compromise in one of these datasets, simultaneously achieving over 90% precision and recall when only 10% of the cards have been victims of fraud.