Goto

Collaborating Authors

 Accuracy


GRAFFL: Gradient-free Federated Learning of a Bayesian Generative Model

arXiv.org Machine Learning

Federated learning platforms are gaining popularity. One of the major benefits is to mitigate the privacy risks as the learning of algorithms can be achieved without collecting or sharing data. While federated learning (i.e., many based on stochastic gradient algorithms) has shown great promise, there are still many challenging problems in protecting privacy, especially during the process of gradients update and exchange. This paper presents the first gradient-free federated learning framework called GRAFFL for learning a Bayesian generative model based on approximate Bayesian computation. Unlike conventional federated learning algorithms based on gradients, our framework does not require to disassemble a model (i.e., to linear components) or to perturb data (or encryption of data for aggregation) to preserve privacy. Instead, this framework uses implicit information derived from each participating institution to learn posterior distributions of parameters. The implicit information is summary statistics derived from SuffiAE that is a neural network developed in this study to create compressed and linearly separable representations thereby protecting sensitive information from leakage. As a sufficient dimensionality reduction technique, this is proved to provide sufficient summary statistics. We propose the GRAFFL-based Bayesian Gaussian mixture model to serve as a proof-of-concept of the framework. Using several datasets, we demonstrated the feasibility and usefulness of our model in terms of privacy protection and prediction performance (i.e., close to an ideal setting). The trained model as a quasi-global model can generate informative samples involving information from other institutions and enhances data analysis of each institution.


Self-Organizing Map assisted Deep Autoencoding Gaussian Mixture Model for Intrusion Detection

arXiv.org Machine Learning

In the information age, a secure and stable network environment is essential and hence intrusion detection is critical for any networks. In this paper, we propose a self-organizing map assisted deep autoencoding Gaussian mixture model (SOMDAGMM) supplemented with well-preserved input space topology for more accurate network intrusion detection. The deep autoencoding Gaussian mixture model comprises a compression network and an estimation network which is able to perform unsupervised joint training. However, the code generated by the autoencoder is inept at preserving the topology of the input space, which is rooted in the bottleneck of the adopted deep structure. A self-organizing map has been introduced to construct SOMDAGMM for addressing this issue. The superiority of the proposed SOM-DAGMM is empirically demonstrated with extensive experiments conducted upon two datasets. Experimental results show that SOM-DAGMM outperforms state-of-the-art DAGMM on all tests, and achieves up to 15.58% improvement in F1 score and with better stability.


How is Machine Learning being Developed to Prevent Phishing Attacks?

#artificialintelligence

Phishing attacks have been causing havoc for many years. They are becoming an increasing concern around the world. Though earlier, there were only phishing email scams now, we see a massive uptick in the frequency of internal and lateral phishing attacks. As per Verizon's 2020 Data Breach Investigations Report (DBIR), 22 percent of breaches in 2019, involved phishing. Meanwhile, according to APWG's Phishing Activity Trends Report for Q1 2020, phishing attacks rose in prevalence to a level that hasn't been observed since 2016, with over 60,000 phishing sites being reported in March alone.


Automatic Player Identification in Dota 2

arXiv.org Artificial Intelligence

Dota 2 is a popular, multiplayer online video game. Like many online games, players are mostly anonymous, being tied only to online accounts which can be readily obtained, sold and shared between multiple people. This makes it difficult to track or ban players who exhibit unwanted behavior online. In this paper, we present a machine learning approach to identify players based a `digital fingerprint' of how they play the game, rather than by account. We use data on mouse movements, in-game statistics and game strategy extracted from match replays and show that for best results, all of these are necessary. We are able to obtain an accuracy of prediction of 95\% for the problem of predicting if two different matches were played by the same player.


Teaching a Machine to Diagnose a Heart Disease; Beginning from digitizing scanned ECGs to detecting the Brugada Syndrome (BrS)

arXiv.org Artificial Intelligence

Medical diagnoses can shape and change the life of a person drastically. Therefore, it is always best advised to collect as much evidence as possible to be certain about the diagnosis. Unfortunately, in the case of the Brugada Syndrome (BrS), a rare and inherited heart disease, only one diagnostic criterion exists, namely, a typical pattern in the Electrocardiogram (ECG). In the following treatise, we question whether the investigation of ECG strips by the means of machine learning methods improves the detection of BrS positive cases and hence, the diagnostic process. We propose a pipeline that reads in scanned images of ECGs, and transforms the encaptured signals to digital time-voltage data after several processing steps. Then, we present a long short-term memory (LSTM) classifier that is built based on the previously extracted data and that makes the diagnosis. The proposed pipeline distinguishes between three major types of ECG images and recreates each recorded lead signal. Features and quality are retained during the digitization of the data, albeit some encountered issues are not fully removed (Part I). Nevertheless, the results of the aforesaid program are suitable for further investigation of the ECG by a computational method such as the proposed classifier which proves the concept and could be the architectural basis for future research (Part II). This thesis is divided into two parts as they are part of the same process but conceptually different. It is hoped that this work builds a new foundation for computational investigations in the case of the BrS and its diagnosis.


Propensity-to-Pay: Machine Learning for Estimating Prediction Uncertainty

arXiv.org Artificial Intelligence

Predicting a customer's propensity-to-pay at an early point in the revenue cycle can provide organisations many opportunities to improve the customer experience, reduce hardship and reduce the risk of impaired cash flow and occurrence of bad debt. With the advancements in data science; machine learning techniques can be used to build models to accurately predict a customer's propensity-to-pay. Creating effective machine learning models without access to large and detailed datasets presents some significant challenges. This paper presents a case-study, conducted on a dataset from an energy organisation, to explore the uncertainty around the creation of machine learning models that are able to predict residential customers entering financial hardship which then reduces their ability to pay energy bills. Incorrect predictions can result in inefficient resource allocation and vulnerable customers not being proactively identified. This study investigates machine learning models' ability to consider different contexts and estimate the uncertainty in the prediction. Seven models from four families of machine learning algorithms are investigated for their novel utilisation. A novel concept of utilising a Baysian Neural Network to the binary classification problem of propensity-to-pay energy bills is proposed and explored for deployment.


Semi-supervised Learning with the EM Algorithm: A Comparative Study between Unstructured and Structured Prediction

arXiv.org Machine Learning

Semi-supervised learning aims to learn prediction models from both labeled and unlabeled samples. There has been extensive research in this area. Among existing work, generative mixture models with Expectation-Maximization (EM) is a popular method due to clear statistical properties. However, existing literature on EM-based semi-supervised learning largely focuses on unstructured prediction, assuming that samples are independent and identically distributed. Studies on EM-based semi-supervised approach in structured prediction is limited. This paper aims to fill the gap through a comparative study between unstructured and structured methods in EM-based semi-supervised learning. Specifically, we compare their theoretical properties and find that both methods can be considered as a generalization of self-training with soft class assignment of unlabeled samples, but the structured method additionally considers structural constraint in soft class assignment. We conducted a case study on real-world flood mapping datasets to compare the two methods. Results show that structured EM is more robust to class confusion caused by noise and obstacles in features in the context of the flood mapping application.


Feature Selection from High-Dimensional Data with Very Low Sample Size: A Cautionary Tale

arXiv.org Machine Learning

In classification problems, the purpose of feature selection is to identify a small, highly discriminative subset of the original feature set. In many applications, the dataset may have thousands of features and only a few dozens of samples (sometimes termed `wide'). This study is a cautionary tale demonstrating why feature selection in such cases may lead to undesirable results. In view to highlight the sample size issue, we derive the required sample size for declaring two features different. Using an example, we illustrate the heavy dependency between feature set and classifier, which poses a question to classifier-agnostic feature selection methods. However, the choice of a good selector-classifier pair is hampered by the low correlation between estimated and true error rate, as illustrated by another example. While previous studies raising similar issues validate their message with mostly synthetic data, here we carried out an experiment with 20 real datasets. We created an exaggerated scenario whereby we cut a very small portion of the data (10 instances per class) for feature selection and used the rest of the data for testing. The results reinforce the caution and suggest that it may be better to refrain from feature selection from very wide datasets rather than return misleading output to the user.


Can a Selfie Help Detect Coronary Artery Disease?

#artificialintelligence

Researchers for an interesting new paper suggested that a new algorithm may make it possible to assist in the diagnosis of coronary artery disease (CAD) with a facial photograph. The paper, published in the European Journal of Cardiology, was a multicenter, cross-sectional study of patients undergoing coronary angiography or CT angiography at nine sites in China. The purpose of evaluating the scans was to train and validate a deep convolutional neural network for CAD detection (at least one 50% stenosis) from facial photographs. The analysis included 5,796 consecutively enrolled patients who were randomly assigned to either training (n 5,216) or validation (n 580) groups for the development of the algorithm. They then enrolled 1,013 patients into the algorithm test group and calculated sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) using radiology-based diagnosis as the standard.


"Improving" prediction of human behavior using behavior modification

arXiv.org Machine Learning

The fields of statistics and machine learning design algorithms, models, and approaches to improve prediction. Larger and richer behavioral data increase predictive power, as evident from recent advances in behavioral prediction technology. Large internet platforms that collect behavioral big data predict user behavior for internal purposes and for third parties (advertisers, insurers, security forces, political consulting firms) who utilize the predictions for personalization, targeting and other decision-making. While standard data collection and modeling efforts are directed at improving predicted values, internet platforms can minimize prediction error by "pushing" users' actions towards their predicted values using behavior modification techniques. The better the platform can make users conform to their predicted outcomes, the more it can boast its predictive accuracy and ability to induce behavior change. Hence, platforms are strongly incentivized to "make predictions true". This strategy is absent from the ML and statistics literature. Investigating its properties requires incorporating causal notation into the correlation-based predictive environment---an integration currently missing. To tackle this void, we integrate Pearl's causal do(.) operator into the predictive framework. We then decompose the expected prediction error given behavior modification, and identify the components impacting predictive power. Our derivation elucidates the implications of such behavior modification to data scientists, platforms, their clients, and the humans whose behavior is manipulated. Behavior modification can make users' behavior more predictable and even more homogeneous; yet this apparent predictability might not generalize when clients use predictions in practice. Outcomes pushed towards their predictions can be at odds with clients' intentions, and harmful to manipulated users.