Regression
Application of Explainable Machine Learning in Detecting and Classifying Ransomware Families Based on API Call Analysis
Mowri, Rawshan Ara, Siddula, Madhuri, Roy, Kaushik
Ransomware has appeared as one of the major global threats in recent days. The alarming increasing rate of ransomware attacks and new ransomware variants intrigue the researchers to constantly examine the distinguishing traits of ransomware and refine their detection strategies. Application Programming Interface (API) is a way for one program to collaborate with another; API calls are the medium by which they communicate. Ransomware uses this strategy to interact with the OS and makes a significantly higher number of calls in different sequences to ask for taking action. This research work utilizes the frequencies of different API calls to detect and classify ransomware families. First, a Web-Crawler is developed to automate collecting the Windows Portable Executable (PE) files of 15 different ransomware families. By extracting different frequencies of 68 API calls, we develop our dataset in the first phase of the two-phase feature engineering process. After selecting the most significant features in the second phase of the feature engineering process, we deploy six Supervised Machine Learning models: Na"ive Bayes, Logistic Regression, Random Forest, Stochastic Gradient Descent, K-Nearest Neighbor, and Support Vector Machine. Then, the performances of all the classifiers are compared to select the best model. The results reveal that Logistic Regression can efficiently classify ransomware into their corresponding families securing 99.15% overall accuracy. Finally, instead of relying on the 'Black box' characteristic of the Machine Learning models, we present the post-hoc analysis of our best-performing model using 'SHapley Additive exPlanations' or SHAP values to ascertain the transparency and trustworthiness of the model's prediction.
A Randomised Subspace Gauss-Newton Method for Nonlinear Least-Squares
Cartis, Coralia, Fowkes, Jaroslav, Shao, Zhen
We propose a Randomised Subspace Gauss-Newton (R-SGN) algorithm for solving nonlinear least-squares optimization problems, that uses a sketched Jacobian of the residual in the variable domain and solves a reduced linear least-squares on each iteration. A sublinear global rate of convergence result is presented for a trust-region variant of R-SGN, with high probability, which matches deterministic counterpart results in the order of the accuracy tolerance. Promising preliminary numerical results are presented for R-SGN on logistic regression and on nonlinear regression problems from the CUTEst collection.
DiSC: Differential Spectral Clustering of Features
Sristi, Ram Dyuthi, Mishne, Gal, Jaffe, Ariel
Selecting subsets of features that differentiate between two conditions is a key task in a broad range of scientific domains. In many applications, the features of interest form clusters with similar effects on the data at hand. To recover such clusters we develop DiSC, a data-driven approach for detecting groups of features that differentiate between conditions. For each condition, we construct a graph whose nodes correspond to the features and whose weights are functions of the similarity between them for that condition. We then apply a spectral approach to compute subsets of nodes whose connectivity differs significantly between the condition-specific feature graphs. On the theoretical front, we analyze our approach with a toy example based on the stochastic block model. We evaluate DiSC on a variety of datasets, including MNIST, hyperspectral imaging, simulated scRNA-seq and task fMRI, and demonstrate that DiSC uncovers features that better differentiate between conditions compared to competing methods.
Innovations in Integrating Machine Learning and Agent-Based Modeling of Biomedical Systems
Sivakumar, Nikita, Mura, Cameron, Peirce, Shayn M.
Agent-based modeling (ABM) is a well-established paradigm for simulating complex systems via interactions between constituent entities. Machine learning (ML) refers to approaches whereby statistical algorithms 'learn' from data on their own, without imposing a priori theories of system behavior. Biological systems -- from molecules, to cells, to entire organisms -- consist of vast numbers of entities, governed by complex webs of interactions that span many spatiotemporal scales and exhibit nonlinearity, stochasticity and intricate coupling between entities. The macroscopic properties and collective dynamics of such systems are difficult to capture via continuum modelling and mean-field formalisms. ABM takes a 'bottom-up' approach that obviates these difficulties by enabling one to easily propose and test a set of well-defined 'rules' to be applied to the individual entities (agents) in a system. Evaluating a system and propagating its state over discrete time-steps effectively simulates the system, allowing observables to be computed and system properties to be analyzed. Because the rules that govern an ABM can be difficult to abstract and formulate from experimental data, there is an opportunity to use ML to help infer optimal, system-specific ABM rules. Once such rule-sets are devised, ABM calculations can generate a wealth of data, and ML can be applied there too -- e.g., to probe statistical measures that meaningfully describe a system's stochastic properties. As an example of synergy in the other direction (from ABM to ML), ABM simulations can generate realistic datasets for training ML algorithms (e.g., for regularization, to mitigate overfitting). In these ways, one can envision various synergistic ABM$\rightleftharpoons$ML loops. This review summarizes how ABM and ML have been integrated in contexts that span spatiotemporal scales, from cellular to population-level epidemiology.
Understanding Benign Overfitting in Gradient-Based Meta Learning
Chen, Lisha, Lu, Songtao, Chen, Tianyi
Meta learning has demonstrated tremendous success in few-shot learning with limited supervised data. In those settings, the meta model is usually overparameterized. While the conventional statistical learning theory suggests that overparameterized models tend to overfit, empirical evidence reveals that overparameterized meta learning methods still work well -- a phenomenon often called "benign overfitting." To understand this phenomenon, we focus on the meta learning settings with a challenging bilevel structure that we term the gradient-based meta learning, and analyze its generalization performance under an overparameterized meta linear regression model. While our analysis uses the relatively tractable linear models, our theory contributes to understanding the delicate interplay among data heterogeneity, model adaptation and benign overfitting in gradient-based meta learning tasks. We corroborate our theoretical claims through numerical simulations.
Deep Learning Prerequisites: Linear Regression in Python
Bestseller Created by Lazy Programmer Inc. This course teaches you about one popular technique used in machine learning, data science and statistics: linear regression. We cover the theory from the ground up: derivation of the solution, and applications to real-world problems. We show you how one might code their own linear regression module in Python.
Minimalist Data Wrangling with Python
Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results. This textbook is a non-profit project. Its online and PDF versions are freely available at https://datawranglingpy.gagolewski.com/.
RIGID: Robust Linear Regression with Missing Data
Aghasi, Alireza, Feizollahi, MohammadJavad, Ghadimi, Saeed
We present a robust framework to perform linear regression with missing entries in the features. By considering an elliptical data distribution, and specifically a multivariate normal model, we are able to conditionally formulate a distribution for the missing entries and present a robust framework, which minimizes the worst case error caused by the uncertainty about the missing data. We show that the proposed formulation, which naturally takes into account the dependency between different variables, ultimately reduces to a convex program, for which a customized and scalable solver can be delivered. In addition to a detailed analysis to deliver such solver, we also asymptoticly analyze the behavior of the proposed framework, and present technical discussions to estimate the required input parameters. We complement our analysis with experiments performed on synthetic, semi-synthetic, and real data, and show how the proposed formulation improves the prediction accuracy and robustness, and outperforms the competing techniques. Missing data is a common problem associated with many datasets in machine learning. With the significant increase in using robust optimization techniques to train machine learning models, this paper presents a novel robust regression framework that operates by minimizing the uncertainty associated with missing data. The proposed approach allows training models with incomplete data, while minimizing the impact of uncertainty associated with the unavailable data. The ideas developed in this paper can be generalized beyond linear models and elliptical data distributions.
Proactive Detractor Detection Framework Based on Message-Wise Sentiment Analysis Over Customer Support Interactions
Gallo, Juan Sebastiรกn Salcedo, Solano, Jesรบs, Garcรญa, Javier Hernรกn, Zarruk-Valencia, David, Correa-Bahnsen, Alejandro
In this work, we propose a framework relying solely on chat-based customer support (CS) interactions for predicting the recommendation decision of individual users. For our case study, we analyzed a total number of 16.4k users and 48.7k customer support conversations within the financial vertical of a large e-commerce company in Latin America. Consequently, our main contributions and objectives are to use Natural Language Processing (NLP) to assess and predict the recommendation behavior where, in addition to using static sentiment analysis, we exploit the predictive power of each user's sentiment dynamics. Our results show that, with respective feature interpretability, it is possible to predict the likelihood of a user to recommend a product or service, based solely on the message-wise sentiment evolution of their CS conversations in a fully automated way.
The Importance of Suppressing Complete Reconstruction in Autoencoders for Unsupervised Outlier Detection
Autoencoders are widely used in outlier detection due to their superiority in handling high-dimensional and nonlinear datasets. The reconstruction of any dataset by the autoencoder can be considered as a complex regression process. In regression analysis, outliers can usually be divided into high leverage points and influential points. Although the autoencoder has shown good results for the identification of influential points, there are still some problems when detect high leverage points. Through theoretical derivation, we found that most outliers are detected in the direction corresponding to the worst-recovered principal component, but in the direction of the well-recovered principal components, the anomalies are often ignored. We propose a new loss function which solve the above deficiencies in outlier detection. The core idea of our scheme is that in order to better detect high leverage points, we should suppress the complete reconstruction of the dataset to convert high leverage points into influential points, and it is also necessary to ensure that the differences between the eigenvalues of the covariance matrix of the original dataset and their corresponding reconstructed results in the direction of each principal component are equal. Besides, we explain the rationality of our scheme through rigorous theoretical derivation. Finally, our experiments on multiple datasets confirm that our scheme significantly improves the accuracy of outlier detection. Outlier detection refers to the process of identifying data points that deviate significantly from normal data point clusters. For multidimensional data points, there are various outliers.