Goto

Collaborating Authors

 Accuracy


SHAPr: An Efficient and Versatile Membership Privacy Risk Metric for Machine Learning

arXiv.org Artificial Intelligence

Data used to train machine learning (ML) models can be sensitive. Membership inference attacks (MIAs), attempting to determine whether a particular data record was used to train an ML model, risk violating membership privacy. ML model builders need a principled definition of a metric to quantify the membership privacy risk of (a) individual training data records, (b) computed independently of specific MIAs, (c) which assesses susceptibility to different MIAs, (d) can be used for different applications, and (e) efficiently. None of the prior membership privacy risk metrics simultaneously meet all these requirements. We present SHAPr, a membership privacy metric based on Shapley values which is a leave-one-out (LOO) technique, originally intended to measure the contribution of a training data record on model utility. We conjecture that contribution to model utility can act as a proxy for memorization, and hence represent membership privacy risk. Using ten benchmark datasets, we show that SHAPr is indeed effective in estimating susceptibility of training data records to MIAs. We also show that, unlike prior work, SHAPr is significantly better in estimating susceptibility to newer, and more effective MIA. We apply SHAPr to evaluate the efficacy of several defenses against MIAs: using regularization and removing high risk training data records. Moreover, SHAPr is versatile: it can be used for estimating vulnerability of different subgroups to MIAs, and inherits applications of Shapley values (e.g., data valuation). We show that SHAPr has an acceptable computational cost (compared to naive LOO), varying from a few minutes for the smallest dataset to ~92 minutes for the largest dataset.


Ensemble of Pre-Trained Neural Networks for Segmentation and Quality Detection of Transmission Electron Microscopy Images

arXiv.org Artificial Intelligence

Automated analysis of electron microscopy datasets poses multiple challenges, such as limitation in the size of the training dataset, variation in data distribution induced by variation in sample quality and experiment conditions, etc. It is crucial for the trained model to continue to provide acceptable segmentation/classification performance on new data, and quantify the uncertainty associated with its predictions. Among the broad applications of machine learning, various approaches have been adopted to quantify uncertainty, such as Bayesian modeling, Monte Carlo dropout, ensembles, etc. With the aim of addressing the challenges specific to the data domain of electron microscopy, two different types of ensembles of pre-trained neural networks were implemented in this work. The ensembles performed semantic segmentation of ice crystal within a two-phase mixture, thereby tracking its phase transformation to water. The first ensemble (EA) is composed of U-net style networks having different underlying architectures, whereas the second series of ensembles (ER-i) are composed of randomly initialized U-net style networks, wherein each base learner has the same underlying architecture 'i'. The encoders of the base learners were pre-trained on the Imagenet dataset. The performance of EA and ER were evaluated on three different metrics: accuracy, calibration, and uncertainty. It is seen that EA exhibits a greater classification accuracy and is better calibrated, as compared to ER. While the uncertainty quantification of these two types of ensembles are comparable, the uncertainty scores exhibited by ER were found to be dependent on the specific architecture of its base member ('i') and not consistently better than EA. Thus, the challenges posed for the analysis of electron microscopy datasets appear to be better addressed by an ensemble design like EA, as compared to an ensemble design like ER.


Identifying a Training-Set Attack's Target Using Renormalized Influence Estimation

arXiv.org Artificial Intelligence

Targeted training-set attacks inject malicious instances into the training set to cause a trained model to mislabel one or more specific test instances. This work proposes the task of target identification, which determines whether a specific test instance is the target of a training-set attack. Target identification can be combined with adversarial-instance identification to find (and remove) the attack instances, mitigating the attack with minimal impact on other predictions. Rather than focusing on a single attack method or data modality, we build on influence estimation, which quantifies each training instance's contribution to a model's prediction. We show that existing influence estimators' poor practical performance often derives from their over-reliance on training instances and iterations with large losses. Our renormalized influence estimators fix this weakness; they far outperform the original estimators at identifying influential groups of training examples in both adversarial and non-adversarial settings, even finding up to 100% of adversarial training instances with no clean-data false positives. Target identification then simplifies to detecting test instances with anomalous influence values. We demonstrate our method's effectiveness on backdoor and poisoning attacks across various data domains, including text, vision, and speech, as well as against a gray-box, adaptive attacker that specifically optimizes the adversarial instances to evade our method. Our source code is available at https://github.com/ZaydH/target_identification.


Beyond Random Split for Assessing Statistical Model Performance

arXiv.org Artificial Intelligence

Even though a train/test split of the dataset randomly performed is a common practice, could not always be the best approach for estimating performance generalization under some scenarios. The fact is that the usual machine learning methodology can sometimes overestimate the generalization error when a dataset is not representative or when rare and elusive examples are a fundamental aspect of the detection problem. In the present work, we analyze strategies based on the predictors' variability to split in training and testing sets. Such strategies aim at guaranteeing the inclusion of rare or unusual examples with a minimal loss of the population's representativeness and provide a more accurate estimation about the generalization error when the dataset is not representative. Two baseline classifiers based on decision trees were used for testing the four splitting strategies considered. Both classifiers were applied on CTU19 a low-representative dataset for a network security detection problem. Preliminary results showed the importance of applying the three alternative strategies to the Monte Carlo splitting strategy in order to get a more accurate error estimation on different but feasible scenarios.


Fraud Detection Using Optimized Machine Learning Tools Under Imbalance Classes

arXiv.org Artificial Intelligence

Fraud detection is a challenging task due to the changing nature of fraud patterns over time and the limited availability of fraud examples to learn such sophisticated patterns. Thus, fraud detection with the aid of smart versions of machine learning (ML) tools is essential to assure safety. Fraud detection is a primary ML classification task; however, the optimum performance of the corresponding ML tool relies on the usage of the best hyperparameter values. Moreover, classification under imbalanced classes is quite challenging as it causes poor performance in minority classes, which most ML classification techniques ignore. Thus, we investigate four state-of-the-art ML techniques, namely, logistic regression, decision trees, random forest, and extreme gradient boost, that are suitable for handling imbalance classes to maximize precision and simultaneously reduce false positives. First, these classifiers are trained on two original benchmark unbalanced fraud detection datasets, namely, phishing website URLs and fraudulent credit card transactions. Then, three synthetically balanced datasets are produced for each original data set by implementing the sampling frameworks, namely, RandomUnderSampler, SMOTE, and SMOTEENN. The optimum hyperparameters for all the 16 experiments are revealed using the method RandomzedSearchCV. The validity of the 16 approaches in the context of fraud detection is compared using two benchmark performance metrics, namely, area under the curve of receiver operating characteristics (AUC ROC) and area under the curve of precision and recall (AUC PR). For both phishing website URLs and credit card fraud transaction datasets, the results indicate that extreme gradient boost trained on the original data shows trustworthy performance in the imbalanced dataset and manages to outperform the other three methods in terms of both AUC ROC and AUC PR.


ASTra: A Novel Algorithm-Level Approach to Imbalanced Classification

arXiv.org Artificial Intelligence

This paper addresses the challenge of handling extreme class imbalance, defined here as a situation in which negative examples, conventionally the majority, outnumber positive examples, usually the ones of most interest, by a factor of 500 or more (in other words, have an imbalance ratio (IR) 500). Such problems are not in fact uncommon, and arise in application areas such as fraud detection [1] and cheminformatics [2]. We make use of two methods, that tackle different, but complementary, aspects of the class imbalance problem: ASTra, a novel, adaptive, asymmetric output layer activation function, which makes the correct classification of minority examples easier. A loss function based on an approximated confusion matrix, which aggressively targets the misclassification of minority examples. Our proposed methods have the advantage of being easy to implement and integrate into the workflow of any model that makes binary predictions normally generated by a sigmoid activation (transfer) function. In addition, the paper presents a new means of monitoring training and validation performance, especially valuable in cases of high class imbalance, that could potentially be used with any training regime, independently of the proposed methods.


Data Provenance via Differential Auditing

arXiv.org Artificial Intelligence

Auditing Data Provenance (ADP), i.e., auditing if a certain piece of data has been used to train a machine learning model, is an important problem in data provenance. The feasibility of the task has been demonstrated by existing auditing techniques, e.g., shadow auditing methods, under certain conditions such as the availability of label information and the knowledge of training protocols for the target model. Unfortunately, both of these conditions are often unavailable in real applications. In this paper, we introduce Data Provenance via Differential Auditing (DPDA), a practical framework for auditing data provenance with a different approach based on statistically significant differentials, i.e., after carefully designed transformation, perturbed input data from the target model's training set would result in much more drastic changes in the output than those from the model's non-training set. This framework allows auditors to distinguish training data from non-training ones without the need of training any shadow models with the help of labeled output data. Furthermore, we propose two effective auditing function implementations, an additive one and a multiplicative one. We report evaluations on real-world data sets demonstrating the effectiveness of our proposed auditing technique.


When is Early Classification of Time Series Meaningful?

arXiv.org Artificial Intelligence

Since its introduction two decades ago, there has been increasing interest in the problem of early classification of time series. This problem generalizes classic time series classification to ask if we can classify a time series subsequence with sufficient accuracy and confidence after seeing only some prefix of a target pattern. The idea is that the earlier classification would allow us to take immediate action, in a domain in which some practical interventions are possible. For example, that intervention might be sounding an alarm or applying the brakes in an automobile. In this work, we make a surprising claim. In spite of the fact that there are dozens of papers on early classification of time series, it is not clear that any of them could ever work in a real-world setting. The problem is not with the algorithms per se but with the vague and underspecified problem description. Essentially all algorithms make implicit and unwarranted assumptions about the problem that will ensure that they will be plagued by false positives and false negatives even if their results suggested that they could obtain near-perfect results. We will explain our findings with novel insights and experiments and offer recommendations to the community.


Analysis of Digitalized ECG Signals Based on Artificial Intelligence and Spectral Analysis Methods Specialized in ARVC

arXiv.org Artificial Intelligence

Arrhythmogenic right ventricular cardiomyopathy (ARVC) is an inherited heart muscle disease that appears between the second and forth decade of a patient's life, being responsible for 20% of sudden cardiac deaths before the age of 35. The effective and punctual diagnosis of this disease based on Electrocardiograms (ECGs) could have a vital role in reducing premature cardiovascular mortality. In our analysis, we firstly outline the digitalization process of paper - based ECG signals enhanced by a spatial filter aiming to eliminate dark regions in the dataset's images that do not correspond to ECG waveform, producing undesirable noise. Next, we propose the utilization of a low - complexity convolutional neural network for the detection of an arrhythmogenic heart disease, that has not been studied through the usage of deep learning methodology to date, achieving high classification accuracy, namely 99.98% training and 98.6% testing accuracy, on a disease the major identification criterion of which are infinitesimal millivolt variations in the ECG's morphology, in contrast with other arrhythmogenic abnormalities. Finally, by performing spectral analysis we investigate significant differentiations in the field of frequencies between normal ECGs and ECGs corresponding to patients suffering from ARVC. In 16 out of the 18 frequencies where we encounter statistically significant differentiations, the normal ECGs are characterized by greater normalized amplitudes compared to the abnormal ones. The overall research carried out in this article highlights the importance of integrating mathematical methods into the examination and effective diagnosis of various diseases, aiming to a substantial contribution to their successful treatment. KEY WORDS: Arrhythmogenic right ventricular cardiomyopathy, Arrhythmia diagnosis, ECG, Signal digitalization, Convolutional neural networks, Arrhythmia detection, Spectral analysis 1. Introduction Arrhythmogenic right ventricular cardiomyopathy (ARVC), is an inherited heart muscle disease characterized by fibro-fatty replacement of the right ventricular myocardium that predisposes patients to arrhythmia and right ventricular (RV) dysfunction leading in some cases to sudden cardiac death (SCD) 2


Machine learning for dynamically predicting the onset of renal replacement therapy in chronic kidney disease patients using claims data

arXiv.org Artificial Intelligence

Chronic kidney disease (CKD) represents a slowly progressive disorder that can eventually require renal replacement therapy (RRT) including dialysis or renal transplantation. Early identification of patients who will require RRT (as much as 1 year in advance) improves patient outcomes, for example by allowing higher-quality vascular access for dialysis. Therefore, early recognition of the need for RRT by care teams is key to successfully managing the disease. Unfortunately, there is currently no commonly used predictive tool for RRT initiation. In this work, we present a machine learning model that dynamically identifies CKD patients at risk of requiring RRT up to one year in advance using only claims data. To evaluate the model, we studied approximately 3 million Medicare beneficiaries for which we made over 8 million predictions. We showed that the model can identify at risk patients with over 90% sensitivity and specificity. Although additional work is required before this approach is ready for clinical use, this study provides a basis for a screening tool to identify patients at risk within a time window that enables early proactive interventions intended to improve RRT outcomes.