Performance Analysis
How to train your draGAN: A task oriented solution to imbalanced classification
Guertler, Leon O., Ashfahani, Andri, Luu, Anh Tuan
The long-standing challenge of building effective classification models for small and imbalanced datasets has seen little improvement since the creation of the Synthetic Minority Over-sampling Technique (SMOTE) over 20 years ago. Though GAN based models seem promising, there has been a lack of purpose built architectures for solving the aforementioned problem, as most previous studies focus on applying already existing models. This paper proposes a unique, performance-oriented, data-generating strategy that utilizes a new architecture, coined draGAN, to generate both minority and majority samples. The samples are generated with the objective of optimizing the classification model's performance, rather than similarity to the real data. We benchmark our approach against state-of-the-art methods from the SMOTE family and competitive GAN based approaches on 94 tabular datasets with varying degrees of imbalance and linearity. Empirically we show the superiority of draGAN, but also highlight some of its shortcomings. All code is available on: https://github.com/LeonGuertler/draGAN.
ClaSP -- Parameter-free Time Series Segmentation
Ermshaus, Arik, Schäfer, Patrick, Leser, Ulf
The study of natural and human-made processes often results in long sequences of temporally-ordered values, aka time series (TS). Such processes often consist of multiple states, e.g. operating modes of a machine, such that state changes in the observed processes result in changes in the distribution of shape of the measured values. Time series segmentation (TSS) tries to find such changes in TS post-hoc to deduce changes in the data-generating process. TSS is typically approached as an unsupervised learning problem aiming at the identification of segments distinguishable by some statistical property. Current algorithms for TSS require domain-dependent hyper-parameters to be set by the user, make assumptions about the TS value distribution or the types of detectable changes which limits their applicability. Common hyperparameters are the measure of segment homogeneity and the number of change points, which are particularly hard to tune for each data set. We present ClaSP, a novel, highly accurate, hyper-parameter-free and domain-agnostic method for TSS. ClaSP hierarchically splits a TS into two parts. A change point is determined by training a binary TS classifier for each possible split point and selecting the one split that is best at identifying subsequences to be from either of the partitions. ClaSP learns its main two model-parameters from the data using two novel bespoke algorithms. In our experimental evaluation using a benchmark of 107 data sets, we show that ClaSP outperforms the state of the art in terms of accuracy and is fast and scalable. Furthermore, we highlight properties of ClaSP using several real-world case studies.
A New Hip Fracture Risk Index Derived from FEA-Computed Proximal Femur Fracture Loads and Energies-to-Failure
Cao, Xuewei, Keyak, Joyce H, Sigurdsson, Sigurdur, Zhao, Chen, Zhou, Weihua, Liu, Anqi, Lang, Thomas, Deng, Hong-Wen, Gudnason, Vilmundur, Sha, Qiuying
Hip fracture risk assessment is an important but challenging task. Quantitative CT-based patient specific finite element analysis (FEA) computes the force (fracture load) to break the proximal femur in a particular loading condition. It provides different structural information about the proximal femur that can influence a subject overall fracture risk. To obtain a more robust measure of fracture risk, we used principal component analysis (PCA) to develop a global FEA computed fracture risk index that incorporates the FEA-computed yield and ultimate failure loads and energies to failure in four loading conditions (single-limb stance and impact from a fall onto the posterior, posterolateral, and lateral aspects of the greater trochanter) of 110 hip fracture subjects and 235 age and sex matched control subjects from the AGES-Reykjavik study. We found that the first PC (PC1) of the FE parameters was the only significant predictor of hip fracture. Using a logistic regression model, we determined if prediction performance for hip fracture using PC1 differed from that using FE parameters combined by stratified random resampling with respect to hip fracture status. The results showed that the average of the area under the receive operating characteristic curve (AUC) using PC1 was always higher than that using all FE parameters combined in the male subjects. The AUC of PC1 and AUC of the FE parameters combined were not significantly different than that in the female subjects or in all subjects
DeepHider: A Covert NLP Watermarking Framework Based on Multi-task Learning
Dai, Long, Mao, Jiarong, Fan, Xuefeng, Zhou, Xiaoyi
Natural language processing (NLP) technology has shown great commercial value in applications such as sentiment analysis. But NLP models are vulnerable to the threat of pirated redistribution, damaging the economic interests of model owners. Digital watermarking technology is an effective means to protect the intellectual property rights of NLP model. The existing NLP model protection mainly designs watermarking schemes by improving both security and robustness purposes, however, the security and robustness of these schemes have the following problems, respectively: (1) Watermarks are difficult to defend against fraudulent declaration by adversary and are easily detected and blocked from verification by human or anomaly detector during the verification process. (2) The watermarking model cannot meet multiple robustness requirements at the same time. To solve the above problems, this paper proposes a novel watermarking framework for NLP model based on the over-parameterization of depth model and the multi-task learning theory. Specifically, a covert trigger set is established to realize the perception-free verification of the watermarking model, and a novel auxiliary network is designed to improve the robustness and security of the watermarking model. The proposed framework was evaluated on two benchmark datasets and three mainstream NLP models, and the results show that the framework can successfully validate model ownership with 100% validation accuracy and advanced robustness and security without compromising the host model performance.
Clustering based opcode graph generation for malware variant detection
Fok, Kar Wai, Thing, Vrizlynn L. L.
Malwares are the key means leveraged by threat actors in the cyber space for their attacks. There is a large array of commercial solutions in the market and significant scientific research to tackle the challenge of the detection and defense against malwares. At the same time, attackers also advance their capabilities in creating polymorphic and metamorphic malwares to make it increasingly challenging for existing solutions. To tackle this issue, we propose a methodology to perform malware detection and family attribution. The proposed methodology first performs the extraction of opcodes from malwares in each family and constructs their respective opcode graphs. We explore the use of clustering algorithms on the opcode graphs to detect clusters of malwares within the same malware family. Such clusters can be seen as belonging to different sub-family groups. Opcode graph signatures are built from each detected cluster. Hence, for each malware family, a group of signatures is generated to represent the family. These signatures are used to classify an unknown sample as benign or belonging to one the malware families. We evaluate our methodology by performing experiments on a dataset consisting of both benign files and malware samples belonging to a number of different malware families and comparing the results to existing approach.
Leveraging Algorithmic Fairness to Mitigate Blackbox Attribute Inference Attacks
Aalmoes, Jan, Duddu, Vasisht, Boutet, Antoine
Machine learning (ML) models have been deployed for high-stakes applications, e.g., healthcare and criminal justice. Prior work has shown that ML models are vulnerable to attribute inference attacks where an adversary, with some background knowledge, trains an ML attack model to infer sensitive attributes by exploiting distinguishable model predictions. However, some prior attribute inference attacks have strong assumptions about adversary's background knowledge (e.g., marginal distribution of sensitive attribute) and pose no more privacy risk than statistical inference. Moreover, none of the prior attacks account for class imbalance of sensitive attribute in datasets coming from real-world applications (e.g., Race and Sex). In this paper, we propose an practical and effective attribute inference attack that accounts for this imbalance using an adaptive threshold over the attack model's predictions. We exhaustively evaluate our proposed attack on multiple datasets and show that the adaptive threshold over the model's predictions drastically improves the attack accuracy over prior work. Finally, current literature lacks an effective defence against attribute inference attacks. We investigate the impact of fairness constraints (i.e., designed to mitigate unfairness in model predictions) during model training on our attribute inference attack. We show that constraint based fairness algorithms which enforces equalized odds acts as an effective defense against attribute inference attacks without impacting the model utility. Hence, the objective of algorithmic fairness and sensitive attribute privacy are aligned.
Intrusion Detection in Internet of Things using Convolutional Neural Networks
Kodys, Martin, Lu, Zhi, Fok, Kar Wai, Thing, Vrizlynn L. L.
Internet of Things (IoT) has become a popular paradigm to fulfil needs of the industry such as asset tracking, resource monitoring and automation. As security mechanisms are often neglected during the deployment of IoT devices, they are more easily attacked by complicated and large volume intrusion attacks using advanced techniques. Artificial Intelligence (AI) has been used by the cyber security community in the past decade to automatically identify such attacks. However, deep learning methods have yet to be extensively explored for Intrusion Detection Systems (IDS) specifically for IoT. Most recent works are based on time sequential models like LSTM and there is short of research in CNNs as they are not naturally suited for this problem. In this article, we propose a novel solution to the intrusion attacks against IoT devices using CNNs. The data is encoded as the convolutional operations to capture the patterns from the sensors data along time that are useful for attacks detection by CNNs. The proposed method is integrated with two classical CNNs: ResNet and EfficientNet, where the detection performance is evaluated. The experimental results show significant improvement in both true positive rate and false positive rate compared to the baseline using LSTM.
Machine Learning-Assisted Recurrence Prediction for Early-Stage Non-Small-Cell Lung Cancer Patients
Janik, Adrianna, Torrente, Maria, Costabello, Luca, Calvo, Virginia, Walsh, Brian, Camps, Carlos, Mohamed, Sameh K., Ortega, Ana L., Nováček, Vít, Massutí, Bartomeu, Minervini, Pasquale, Campelo, M. Rosario Garcia, del Barco, Edel, Bosch-Barrera, Joaquim, Menasalvas, Ernestina, Timilsina, Mohan, Provencio, Mariano
Background: Stratifying cancer patients according to risk of relapse can personalize their care. In this work, we provide an answer to the following research question: How to utilize machine learning to estimate probability of relapse in early-stage non-small-cell lung cancer patients? Methods: For predicting relapse in 1,387 early-stage (I-II), non-small-cell lung cancer (NSCLC) patients from the Spanish Lung Cancer Group data (65.7 average age, 24.8% females, 75.2% males) we train tabular and graph machine learning models. We generate automatic explanations for the predictions of such models. For models trained on tabular data, we adopt SHAP local explanations to gauge how each patient feature contributes to the predicted outcome. We explain graph machine learning predictions with an example-based method that highlights influential past patients. Results: Machine learning models trained on tabular data exhibit a 76% accuracy for the Random Forest model at predicting relapse evaluated with a 10-fold cross-validation (model was trained 10 times with different independent sets of patients in test, train and validation sets, the reported metrics are averaged over these 10 test sets). Graph machine learning reaches 68% accuracy over a 200-patient, held-out test set, calibrated on a held-out set of 100 patients. Conclusions: Our results show that machine learning models trained on tabular and graph data can enable objective, personalised and reproducible prediction of relapse and therefore, disease outcome in patients with early-stage NSCLC. With further prospective and multisite validation, and additional radiological and molecular data, this prognostic model could potentially serve as a predictive decision support tool for deciding the use of adjuvant treatments in early-stage lung cancer. Keywords: Non-Small-Cell Lung Cancer, Tumor Recurrence Prediction, Machine Learning
Machine Learning for Software Engineering: A Tertiary Study
Kotti, Zoe, Galanopoulou, Rafaila, Spinellis, Diomidis
Through ML we can address SE problems that cannot be completely algorithmically modeled, or for which existing solutions do not provide satisfactory results yet (e.g., defect/fault detection [16, 165, 180]). In addition, ML finds application in SE tasks where data cannot be easily analyzed with other algorithms (e.g., software requirements, code comments, code reviews, issues [9, 91, 174]). Another important aspect of ML is that it can significantly reduce manual effort in common SE tasks (e.g., automatic program repair [157], code suggestion [61], defect prediction [19], malware detection [147], feature location [40]) with great accuracy results [146, 164]. In fields such as health informatics ML and SE are considered complementary disciplines, since the growing scale and complexity of healthcare datasets have posed a challenge for clinical practice and medical research, requiring new engineering approaches from both fields [38]. In the early nineties, Huff and Selfridge [68] recognized the need for creating software systems that partially take some responsibility for their own evolution, offering the ability to implement, measure, and assess changes easily. These changes should also contribute to the overall improvement of the corresponding systems [142].
Fast Uncertainty Estimates in Deep Learning Interatomic Potentials
Zhu, Albert, Batzner, Simon, Musaelian, Albert, Kozinsky, Boris
Deep learning has emerged as a promising paradigm to give access to highly accurate predictions of molecular and materials properties. A common short-coming shared by current approaches, however, is that neural networks only give point estimates of their predictions and do not come with predictive uncertainties associated with these estimates. Existing uncertainty quantification efforts have primarily leveraged the standard deviation of predictions across an ensemble of independently trained neural networks. This incurs a large computational overhead in both training and prediction that often results in order-of-magnitude more expensive predictions. Here, we propose a method to estimate the predictive uncertainty based on a single neural network without the need for an ensemble. This allows us to obtain uncertainty estimates with virtually no additional computational overhead over standard training and inference. We demonstrate that the quality of the uncertainty estimates matches those obtained from deep ensembles. We further examine the uncertainty estimates of our methods and deep ensembles across the configuration space of our test system and compare the uncertainties to the potential energy surface. Finally, we study the efficacy of the method in an active learning setting and find the results to match an ensemble-based strategy at order-of-magnitude reduced computational cost.