Gradient Descent
Application of Explainable Machine Learning in Detecting and Classifying Ransomware Families Based on API Call Analysis
Mowri, Rawshan Ara, Siddula, Madhuri, Roy, Kaushik
Ransomware has appeared as one of the major global threats in recent days. The alarming increasing rate of ransomware attacks and new ransomware variants intrigue the researchers to constantly examine the distinguishing traits of ransomware and refine their detection strategies. Application Programming Interface (API) is a way for one program to collaborate with another; API calls are the medium by which they communicate. Ransomware uses this strategy to interact with the OS and makes a significantly higher number of calls in different sequences to ask for taking action. This research work utilizes the frequencies of different API calls to detect and classify ransomware families. First, a Web-Crawler is developed to automate collecting the Windows Portable Executable (PE) files of 15 different ransomware families. By extracting different frequencies of 68 API calls, we develop our dataset in the first phase of the two-phase feature engineering process. After selecting the most significant features in the second phase of the feature engineering process, we deploy six Supervised Machine Learning models: Na"ive Bayes, Logistic Regression, Random Forest, Stochastic Gradient Descent, K-Nearest Neighbor, and Support Vector Machine. Then, the performances of all the classifiers are compared to select the best model. The results reveal that Logistic Regression can efficiently classify ransomware into their corresponding families securing 99.15% overall accuracy. Finally, instead of relying on the 'Black box' characteristic of the Machine Learning models, we present the post-hoc analysis of our best-performing model using 'SHapley Additive exPlanations' or SHAP values to ascertain the transparency and trustworthiness of the model's prediction.
Stochastic Saddle Point Problems with Decision-Dependent Distributions
Wood, Killian, Dall'Anese, Emiliano
This paper focuses on stochastic saddle point problems with decision-dependent distributions. These are problems whose objective is the expected value of a stochastic payoff function and whose data distribution drifts in response to decision variables--a phenomenon represented by a distributional map. A common approach to accommodating distributional shift is to retrain optimal decisions once a new distribution is revealed, or repeated retraining. We introduce the notion of equilibrium points, which are the fixed points of this repeated retraining procedure, and provide sufficient conditions for their existence and uniqueness. To find equilibrium points, we develop deterministic and stochastic primal-dual algorithms and demonstrate their convergence with constant step-size in the former and polynomial decay step-size schedule in the latter. By modeling errors emerging from a stochastic gradient estimator as sub-Weibull random variables, we provide error bounds in expectation and in high probability that hold for each iteration. Without additional knowledge of the distributional map, computing saddle points is intractable. Thus we propose a condition on the distributional map--which we call opposing mixture dominance--that ensures that the objective is strongly-convex-strongly-concave. Finally, we demonstrate that derivative-free algorithms with a single function evaluation are capable of approximating saddle points
Screening for Sparse Online Learning
Sparsity promoting regularizers are widely used to impose low-complexity structure (e.g. l1-norm for sparsity) to the regression coefficients of supervised learning. In the realm of deterministic optimization, the sequence generated by iterative algorithms (such as proximal gradient descent) exhibit "finite activity identification", namely, they can identify the low-complexity structure in a finite number of iterations. However, most online algorithms (such as proximal stochastic gradient descent) do not have the property owing to the vanishing step-size and non-vanishing variance. In this paper, by combining with a screening rule, we show how to eliminate useless features of the iterates generated by online algorithms, and thereby enforce finite activity identification. One consequence is that when combined with any convergent online algorithm, sparsity properties imposed by the regularizer can be exploited for computational gains. Numerically, significant acceleration can be obtained.
Sketched Gaussian Model Linear Discriminant Analysis via the Randomized Kaczmarz Method
Chi, Jocelyn T., Needell, Deanna
We harness a least squares formulation and mobilize the stochastic gradient descent framework. Therefore, we obtain a randomized classifier with performance that is very comparable to that of full data LDA while requiring access to only one row of the training data at a time. We present convergence guarantees for the sketched predictions on new data within a fixed number of iterations. These guarantees account for both the Gaussian modeling assumptions on the data and algorithmic randomness from the sketching procedure. Finally, we demonstrate performance with varying step-sizes and numbers of iterations. Our numerical experiments demonstrate that sketched LDA can offer a very viable alternative to full data LDA when the data may be too large for full data analysis.
Finding Second-Order Stationary Points in Nonconvex-Strongly-Concave Minimax Optimization
Luo, Luo, Li, Yujun, Chen, Cheng
We study the smooth minimax optimization problem $\min_{\bf x}\max_{\bf y} f({\bf x},{\bf y})$, where $f$ is $\ell$-smooth, strongly-concave in ${\bf y}$ but possibly nonconvex in ${\bf x}$. Most of existing works focus on finding the first-order stationary points of the function $f({\bf x},{\bf y})$ or its primal function $P({\bf x})\triangleq \max_{\bf y} f({\bf x},{\bf y})$, but few of them focus on achieving second-order stationary points. In this paper, we propose a novel approach for minimax optimization, called Minimax Cubic Newton (MCN), which could find an $\big(\varepsilon,\kappa^{1.5}\sqrt{\rho\varepsilon}\,\big)$-second-order stationary point of $P({\bf x})$ with calling ${\mathcal O}\big(\kappa^{1.5}\sqrt{\rho}\varepsilon^{-1.5}\big)$ times of second-order oracles and $\tilde{\mathcal O}\big(\kappa^{2}\sqrt{\rho}\varepsilon^{-1.5}\big)$ times of first-order oracles, where $\kappa$ is the condition number and $\rho$ is the Lipschitz continuous constant for the Hessian of $f({\bf x},{\bf y})$. In addition, we propose an inexact variant of MCN for high-dimensional problems to avoid calling expensive second-order oracles. Instead, our method solves the cubic sub-problem inexactly via gradient descent and matrix Chebyshev expansion. This strategy still obtains the desired approximate second-order stationary point with high probability but only requires $\tilde{\mathcal O}\big(\kappa^{1.5}\ell\varepsilon^{-2}\big)$ Hessian-vector oracle calls and $\tilde{\mathcal O}\big(\kappa^{2}\sqrt{\rho}\varepsilon^{-1.5}\big)$ first-order oracle calls. To the best of our knowledge, this is the first work that considers the non-asymptotic convergence behavior of finding second-order stationary points for minimax problems without the convex-concave assumptions.
Accelerating Parallel Stochastic Gradient Descent via Non-blocking Mini-batches
SOTA decentralized SGD algorithms can overcome the bandwidth bottleneck at the parameter server by using communication collectives like Ring All-Reduce for synchronization. While the parameter updates in distributed SGD may happen asynchronously there is still a synchronization barrier to make sure that the local training epoch at every learner is complete before the learners can advance to the next epoch. The delays in waiting for the slowest learners(stragglers) remain to be a problem in the synchronization steps of these state-of-the-art decentralized frameworks. In this paper, we propose the (de)centralized Non-blocking SGD (Non-blocking SGD) which can address the straggler problem in a heterogeneous environment. The main idea of Non-blocking SGD is to split the original batch into mini-batches, then accumulate the gradients and update the model based on finished mini-batches. The Non-blocking idea can be implemented using decentralized algorithms including Ring All-reduce, D-PSGD, and MATCHA to solve the straggler problem. Moreover, using gradient accumulation to update the model also guarantees convergence and avoids gradient staleness. Run-time analysis with random straggler delays and computational efficiency/throughput of devices is also presented to show the advantage of Non-blocking SGD. Experiments on a suite of datasets and deep learning networks validate the theoretical analyses and demonstrate that Non-blocking SGD speeds up the training and fastens the convergence. Compared with the state-of-the-art decentralized asynchronous algorithms like D-PSGD and MACHA, Non-blocking SGD takes up to 2x fewer time to reach the same training loss in a heterogeneous environment.
Zero-Label Prompt Selection
Liao, Chonghua, Zheng, Yanan, Yang, Zhilin
Natural language prompts have been shown to facilitate cross-task generalization for large language models. However, with no or limited labeled examples, the cross-task performance is highly sensitive to the choice of prompts, while selecting a high-performing prompt is challenging given the scarcity of labels. To address the issue, we propose a Zero-Label Prompt Selection (ZPS) method that selects prompts without any labeled data or gradient update. Specifically, given the candidate human-written prompts for a task, ZPS labels a set of unlabeled data with a prompt ensemble and uses the pseudo-labels for prompt selection. Experiments show that ZPS improves over prior methods by a sizeable margin in zero-label performance. We also extend ZPS to a few-shot setting and show its advantages over strong baselines such as prompt tuning and model tuning. Recently, extensive studies have shown that large language models (LLMs) have promising performance for few-shot learning (Brown et al., 2020; Zhao et al., 2021; Schick & Schütze, 2021; Gao et al., 2021), and they even show strong generalization abilities to new tasks without any annotated data (Brown et al., 2020; Wei et al., 2021; Sanh et al., 2021). Different from conventional fine-tuning methods that require expensive parameter updates for each downstream task, prompts are employed to provide in-context information or task instructions, which is helpful for guiding models to perform each task. Manually-written prompts are often used to specify the task and unify the format of inputs. However, the performance of different prompts during evaluation can vary from near state-of-the-art to random guess; e.g., using a non-optimal prompt can cause a performance drop of up to 60 points on the CB task (Zhao et al., 2021).
Stochastic optimization on matrices and a graphon McKean-Vlasov limit
Harchaoui, Zaid, Oh, Sewoong, Pal, Soumik, Somani, Raghav, Tripathi, Raghavendra
We consider stochastic gradient descents on the space of large symmetric matrices of suitable functions that are invariant under permuting the rows and columns using the same permutation. We establish deterministic limits of these random curves as the dimensions of the matrices go to infinity while the entries remain bounded. Under a "small noise" assumption the limit is shown to be the gradient flow of functions on graphons whose existence was established in [OPST21]. We also consider limits of stochastic gradient descents with added properly scaled reflected Brownian noise. The limiting curve of graphons is characterized by a family of stochastic differential equations with reflections and can be thought of as an extension of the classical McKean-Vlasov limit for interacting diffusions to the graphon setting. The proofs introduce a family of infinite-dimensional exchangeable arrays of reflected diffusions and a novel notion of propagation of chaos for large matrices of diffusions converging to such arrays in a suitable sense.
Gradient-Based Constrained Sampling from Language Models
Kumar, Sachin, Paria, Biswajit, Tsvetkov, Yulia
Large pretrained language models generate fluent text but are notoriously hard to controllably sample from. In this work, we study constrained sampling from such language models: generating text that satisfies user-defined constraints, while maintaining fluency and the model's performance in a downstream task. We propose MuCoLa -- a sampling procedure that combines the log-likelihood of the language model with arbitrary (differentiable) constraints in a single energy function, and then generates samples in a non-autoregressive manner. Specifically, it initializes the entire output sequence with noise and follows a Markov chain defined by Langevin Dynamics using the gradients of the energy function. We evaluate MuCoLa on text generation with soft and hard constraints as well as their combinations obtaining significant improvements over competitive baselines for toxicity avoidance, sentiment control, and keyword-guided generation.
Black Box Lie Group Preconditioners for SGD
A matrix free and a low rank approximation preconditioner are proposed to accelerate the convergence of stochastic gradient descent (SGD) by exploiting curvature information sampled from Hessian-vector products or finite differences of parameters and gradients similar to the BFGS algorithm. Both preconditioners are fitted with an online updating manner minimizing a criterion that is free of line search and robust to stochastic gradient noise, and further constrained to be on certain connected Lie groups to preserve their corresponding symmetry or invariance, e.g., orientation of coordinates by the connected general linear group with positive determinants. The Lie group's equivariance property facilitates preconditioner fitting, and its invariance property saves any need of damping, which is common in second-order optimizers, but difficult to tune. The learning rate for parameter updating and step size for preconditioner fitting are naturally normalized, and their default values work well in most situations.