Goto

Collaborating Authors

 Performance Analysis


Preservation of Anomalous Subgroups On Machine Learning Transformed Data

arXiv.org Machine Learning

In this paper, we investigate the effect of machine learning based anonymization on anomalous subgroup preservation. In particular, we train a binary classifier to discover the most anomalous subgroup in a dataset by maximizing the bias between the group's predicted odds ratio from the model and observed odds ratio from the data. We then perform anonymization using a variational autoencoder (VAE) to synthesize an entirely new dataset that would ideally be drawn from the distribution of the original data. We repeat the anomalous subgroup discovery task on the new data and compare it to what was identified pre-anonymization. We evaluated our approach using publicly available datasets from the financial industry. Our evaluation confirmed that the approach was able to produce synthetic datasets that preserved a high level of subgroup differentiation as identified initially in the original dataset. Such a distinction was maintained while having distinctly different records between the synthetic and original dataset. Finally, we packed the above end to end process into what we call Utility Guaranteed Deep Privacy (UGDP) system. UGDP can be easily extended to onboard alternative generative approaches such as GANs to synthesize tabular data.


Missing Features Reconstruction and Its Impact on Classification Accuracy

arXiv.org Machine Learning

In real-world applications, we can encounter situations when a well-trained model has to be used to predict from a damaged dataset. The damage caused by missing or corrupted values can be either on the level of individual instances or on the level of entire features. Both situations have a negative impact on the usability of the model on such a dataset. This paper focuses on the scenario where entire features are missing which can be understood as a specific case of transfer learning. Our aim is to experimentally research the influence of various imputation methods on the performance of several classification models. The imputation impact is researched on a combination of traditional methods such as k-NN, linear regression, and MICE compared to modern imputation methods such as multi-layer perceptron (MLP) and gradient boosted trees (XGBT). For linear regression, MLP, and XGBT we also propose two approaches to using them for multiple features imputation. The experiments were performed on both real world and artificial datasets with continuous features where different numbers of features, varying from one feature to 50%, were missing. The results show that MICE and linear regression are generally good imputers regardless of the conditions. On the other hand, the performance of MLP and XGBT is strongly dataset dependent. Their performance is the best in some cases, but more often they perform worse than MICE or linear regression.


Variance Reduced Stochastic Proximal Algorithm for AUC Maximization

arXiv.org Machine Learning

Stochastic Gradient Descent has been widely studied with classification accuracy as a performance measure. However, these stochastic algorithms cannot be directly used when non-decomposable pairwise performance measures are used such as Area under the ROC curve (AUC) which is a common performance metric when the classes are imbalanced. There have been several algorithms proposed for optimizing AUC as a performance metric, and one of the recent being a stochastic proximal gradient algorithm (SPAM). But the downside of the stochastic methods is that they suffer from high variance leading to slower convergence. To combat this issue, several variance reduced methods have been proposed with faster convergence guarantees than vanilla stochastic gradient descent. Again, these variance reduced methods are not directly applicable when non-decomposable performance measures are used. In this paper, we develop a Variance Reduced Stochastic Proximal algorithm for AUC Maximization (\textsc{VRSPAM}) and perform a theoretical analysis as well as empirical analysis to show that our algorithm converges faster than SPAM which is the previous state-of-the-art for the AUC maximization problem.


AutoIDS: Auto-encoder Based Method for Intrusion Detection System

arXiv.org Machine Learning

--Intrusion Detection System (IDS) is one of the most effective solutions for providing primary security services. IDSs are generally working based on attack signatures or by detecting anomalies. In this paper, we have presented AutoIDS, a novel yet efficient solution for IDS, based on a semi-supervised machine learning technique. AutoIDS can distinguish abnormal packet flows from normal ones by taking advantage of cascading two efficient detectors. These detectors are two encoder-decoder neural networks that are forced to provide a compressed and a sparse representation from the normal flows. In the test phase, failing these neural networks on providing compressed or sparse representation from an incoming packet flow, means such flow does not comply with the normal traffic and thus it is considered as an intrusion. For lowering the computational cost along with preserving the accuracy, a large number of flows are just processed by the first detector . In fact, the second detector is only used for difficult samples which the first detector is not confident about them. We have evaluated AutoIDS on the NSL-KDD benchmark as a widely-used and well-known dataset. The accuracy of AutoIDS is 90.17% showing its superiority compared to the other state-of-the-art methods. OW ADA YS, providing security services in different computer networks is an issue of paramount significance. The principal security services required by almost all of the communication networks, irrespective of their types, are confidentiality, authenticity, non-repudiation, integrity, and availability.


The 5 Classification Evaluation metrics every Data Scientist must know

#artificialintelligence

What do we want to optimize for? Most of the businesses fail to answer this simple question. Every business problem is a little different, and it should be optimized differently. We all have created classification models. A lot of time we try to increase evaluate our models on accuracy.


The 6 Metrics You Need to Optimize for Performance in Machine Learning - Exxact

#artificialintelligence

There are many metrics to measure the performance of your model depending on the type of machine learning you are looking to conduct. In this article, we take a look at performance measures for classification and regression models and discuss which is better optimized. Sometimes the metric to look at will vary according to the problem that is initially being solved. The True Positive Rate also called Recall is the go-to performance measure in binary/non-binary classification problems. Most if not all the time, we are only interested in correctly predicting one class.


The 6 Metrics You Need to Optimize for Performance in Machine Learning

#artificialintelligence

There are many metrics to measure the performance of your model depending on the type of machine learning you are looking to conduct. In this article, we take a look at performance measures for classification and regression models and discuss which is better optimized. Sometimes the metric to look at will vary according to the problem that is initially being solved. The True Positive Rate also called Recall is the go-to performance measure in binary/non-binary classification problems. Most if not all the time, we are only interested in correctly predicting one class.


Graph Domain Adaptation with Localized Graph Signal Representations

arXiv.org Machine Learning

Graph Domain Adaptation with Localized Graph Signal Representations Yusuf Yi git Pilavcı, Eylem Tu g ce G uneyi, Cemil Cengiz and Elif Vural Abstract In this paper we propose a domain adaptation algorithm designed for graph domains. Given a source graph with many labeled nodes and a target graph with few or no labeled nodes, we aim to estimate the target labels by making use of the similarity between the characteristics of the variation of the label functions on the two graphs. Our assumption about the source and the target domains is that the local behaviour of the label function, such as its spread and speed of variation on the graph, bears resemblance between the two graphs. We estimate the unknown target labels by solving an optimization problem where the label information is transferred from the source graph to the target graph based on the prior that the projections of the label functions onto localized graph bases be similar between the source and the target graphs. In order to efficiently capture the local variation of the label functions on the graphs, spectral graph wavelets are used as the graph bases. Experimentation on various data sets shows that the proposed method yields quite satisfactory classification accuracy compared to reference domain adaptation methods. Keywords: Domain adaptation, spectral graph theory, graph signal processing, spectral graph wavelets, graph Laplacian 1 Introduction A common assumption in machine learning is that the training and the test data are sampled from the same distribution. Domain adaptation methods aim to provide solutions to machine learning problems by dealing with this distribution discrepancy. In domain adaptation, a source domain and a target domain are considered where the label information is mostly available for the data samples in the source domain, and few or none of the class labels are known in the target domain. The purpose is then to improve the learning performance in the target domain by making use Y. Y. Pilavcı is with the GIPSA Lab at Universit e Grenoble Alpes, Grenoble. C. Cengiz is with the Dept. of Computer Science and Engineering at Ko c University, Istanbul. Most part of this work was performed while the authors were at METU. 1 arXiv:1911.02883v1 A variety of approaches have been proposed so far for the domain adaptation problem. Some methods are based on reweighing the samples for removing the sample selection bias [1, 2]. Another common solution is to align the source and the target domains through feature space mappings.


Fair Meta-Learning: Learning How to Learn Fairly

arXiv.org Machine Learning

Data sets for fairness relevant tasks can lack examples or be biased according to a specific label in a sensitive attribute. We demonstrate the usefulness of weight based meta-learning approaches in such situations. For models that can be trained through gradient descent, we demonstrate that there are some parameter configurations that allow models to be optimized from a few number of gradient steps and with minimal data which are both fair and accurate. To learn such weight sets, we adapt the popular MAML algorithm to Fair-MAML by the inclusion of a fairness regularization term. In practice, Fair-MAML allows practitioners to train fair machine learning models from only a few examples when data from related tasks is available. We empirically exhibit the value of this technique by comparing to relevant baselines.


Researchers develop machine learning-based detector that stops lateral phishing attacks - Help Net Security

#artificialintelligence

Lateral phishing attacks – scams targeting users from compromised email accounts within an organization – are becoming an increasing concern in the U.S. Whereas in the past attackers would send phishing scams from email accounts external to an organization, recently there's been an explosion of email-borne scams in which an attackers compromise email accounts within organizations, and then uses those accounts to launch internal phishing emails to fellow employees – the kind of attacks known as lateral phishing. And when a phishing email comes from an internal account, the vast majority of email security systems can't stop it. Existing security systems largely detect cyber attacks that come from the outside, relying on signals like IP and domain reputation, which are ineffective when the email comes from an internal source. Lateral phishing attacks are also costly. FBI data shows that these cyberattacks caused more than $12 billion in losses between 2013-2018.