Regression
Homotopy Parametric Simplex Method for Sparse Learning
Pang, Haotian, Vanderbei, Robert, Liu, Han, Zhao, Tuo
High dimensional sparse learning has imposed a great computational challenge to large scale data analysis. In this paper, we are interested in a broad class of sparse learning approaches formulated as linear programs parametrized by a {\em regularization factor}, and solve them by the parametric simplex method (PSM). Our parametric simplex method offers significant advantages over other competing methods: (1) PSM naturally obtains the complete solution path for all values of the regularization parameter; (2) PSM provides a high precision dual certificate stopping criterion; (3) PSM yields sparse solutions through very few iterations, and the solution sparsity significantly reduces the computational cost per iteration. Particularly, we demonstrate the superiority of PSM over various sparse learning approaches, including Dantzig selector for sparse linear regression, LAD-Lasso for sparse robust linear regression, CLIME for sparse precision matrix estimation, sparse differential network estimation, and sparse Linear Programming Discriminant (LPD) analysis. We then provide sufficient conditions under which PSM always outputs sparse solutions such that its computational performance can be significantly boosted. Thorough numerical experiments are provided to demonstrate the outstanding performance of the PSM method.
The 5 Phases of Every Machine Learning Project โ Blog
Machine learning and predictive analytics are pervasive in our lives today. AI impacts nearly everything we do and interact with including retail and wholesale pricing, consumer habits and behaviors, marketing and advertising, politics, entertainment, sports, medicine, business logistics and planning, fraud and risk detection, airline and truck route planning, pricing strategy, gaming, AI speech recognition, AI image recognition, self-driving cars, and robotics. Yet whether you are creating a self-driving car, predicting customer churn, or cresting a product recommendation system, all machine learning projects follow the same process and the same five basic phases. Data is the new oil. It is quickly becoming the most valuable commodity in the world. Data is like oil because it fuels machine learning projects. Without data, there is no machine learning and no predictive analytics. And just like grades of oil, there are grades of data. Supreme data is like rocket fuel for machine learning models, and buyers pay a premium for it.
Getting started with Machine Learning in MS Excel using XLMiner
Machine Learning is nothing but building a'machine' which'learns' from its experience. And, becomes better with experience โ just like humans. We also learn from our experiences. Companies like Google, Facebook, Microsoft are using machine learning techniques at a larger scale. However, one common mis-conception people have is that they need to learn coding to start machine learning.
Computing the quality of the Laplace approximation
Bayesian inference requires approximation methods to become computable, but for most of them it is impossible to quantify how close the approximation is to the true posterior. In this work, we present a theorem upper-bounding the KL divergence between a log-concave target density $f\left(\boldsymbol{\theta}\right)$ and its Laplace approximation $g\left(\boldsymbol{\theta}\right)$. The bound we present is computable: on the classical logistic regression model, we find our bound to be almost exact as long as the dimensionality of the parameter space is high. The approach we followed in this work can be extended to other Gaussian approximations, as we will do in an extended version of this work, to be submitted to the Annals of Statistics. It will then become a critical tool for characterizing whether, for a given problem, a given Gaussian approximation is suitable, or whether a more precise alternative method should be used instead.
Global optimization for low-dimensional switching linear regression and bounded-error estimation
The paper provides global optimization algorithms for two particularly difficult nonconvex problems raised by hybrid system identification: switching linear regression and bounded-error estimation. While most works focus on local optimization heuristics without global optimality guarantees or with guarantees valid only under restrictive conditions, the proposed approach always yields a solution with a certificate of global optimality. This approach relies on a branch-and-bound strategy for which we devise lower bounds that can be efficiently computed. In order to obtain scalable algorithms with respect to the number of data, we directly optimize the model parameters in a continuous optimization setting without involving integer variables. Numerical experiments show that the proposed algorithms offer a higher accuracy than convex relaxations with a reasonable computational burden for hybrid system identification. In addition, we discuss how bounded-error estimation is related to robust estimation in the presence of outliers and exact recovery under sparse noise, for which we also obtain promising numerical results.
Causal nearest neighbor rules for optimal treatment regimes
Zhou, Xin, Kosorok, Michael R.
The estimation of optimal treatment regimes is of considerable interest to precision medicine. In this work, we propose a causal $k$-nearest neighbor method to estimate the optimal treatment regime. The method roots in the framework of causal inference, and estimates the causal treatment effects within the nearest neighborhood. Although the method is simple, it possesses nice theoretical properties. We show that the causal $k$-nearest neighbor regime is universally consistent. That is, the causal $k$-nearest neighbor regime will eventually learn the optimal treatment regime as the sample size increases. We also establish its convergence rate. However, the causal $k$-nearest neighbor regime may suffer from the curse of dimensionality, i.e. performance deteriorates as dimensionality increases. To alleviate this problem, we develop an adaptive causal $k$-nearest neighbor method to perform metric selection and variable selection simultaneously. The performance of the proposed methods is illustrated in simulation studies and in an analysis of a chronic depression clinical trial.
Streaming Weak Submodularity: Interpreting Neural Networks on the Fly
Elenberg, Ethan R., Dimakis, Alexandros G., Feldman, Moran, Karbasi, Amin
In many machine learning applications, it is important to explain the predictions of a black-box classifier. For example, why does a deep neural network assign an image to a particular class? We cast interpretability of black-box classifiers as a combinatorial maximization problem and propose an efficient streaming algorithm to solve it subject to cardinality constraints. By extending ideas from Badanidiyuru et al. [2014], we provide a constant factor approximation guarantee for our algorithm in the case of random stream order and a weakly submodular objective function. This is the first such theoretical guarantee for this general class of functions, and we also show that no such algorithm exists for a worst case stream order. Our algorithm obtains similar explanations of Inception V3 predictions $10$ times faster than the state-of-the-art LIME framework of Ribeiro et al. [2016].
On Faster Convergence of Cyclic Block Coordinate Descent-type Methods for Strongly Convex Minimization
Li, Xingguo, Zhao, Tuo, Arora, Raman, Liu, Han, Hong, Mingyi
The cyclic block coordinate descent-type (CBCD-type) methods, which performs iterative updates for a few coordinates (a block) simultaneously throughout the procedure, have shown remarkable computational performance for solving strongly convex minimization problems. Typical applications include many popular statistical machine learning methods such as elastic-net regression, ridge penalized logistic regression, and sparse additive regression. Existing optimization literature has shown that for strongly convex minimization, the CBCD-type methods attain iteration complexity of $\mathcal{O}(p\log(1/\epsilon))$, where $\epsilon$ is a pre-specified accuracy of the objective value, and $p$ is the number of blocks. However, such iteration complexity explicitly depends on $p$, and therefore is at least $p$ times worse than the complexity $\mathcal{O}(\log(1/\epsilon))$ of gradient descent (GD) methods. To bridge this theoretical gap, we propose an improved convergence analysis for the CBCD-type methods. In particular, we first show that for a family of quadratic minimization problems, the iteration complexity $\mathcal{O}(\log^2(p)\cdot\log(1/\epsilon))$ of the CBCD-type methods matches that of the GD methods in term of dependency on $p$, up to a $\log^2 p$ factor. Thus our complexity bounds are sharper than the existing bounds by at least a factor of $p/\log^2(p)$. We also provide a lower bound to confirm that our improved complexity bounds are tight (up to a $\log^2 (p)$ factor), under the assumption that the largest and smallest eigenvalues of the Hessian matrix do not scale with $p$. Finally, we generalize our analysis to other strongly convex minimization problems beyond quadratic ones.
Proximal Gradient Method with Extrapolation and Line Search for a Class of Nonconvex and Nonsmooth Problems
In this paper, we consider a class of possibly nonconvex, nonsmooth and non-Lipschitz optimization problems arising in many contemporary applications such as machine learning, variable selection and image processing. To solve this class of problems, we propose a proximal gradient method with extrapolation and line search (PGels). This method is developed based on a special potential function and successfully incorporates both extrapolation and non-monotone line search, which are two simple and efficient accelerating techniques for the proximal gradient method. Thanks to the line search, this method allows more flexibilities in choosing the extrapolation parameters and updates them adaptively at each iteration if a certain line search criterion is not satisfied. Moreover, with proper choices of parameters, our PGels reduces to many existing algorithms. We also show that, under some mild conditions, our line search criterion is well defined and any cluster point of the sequence generated by PGels is a stationary point of our problem. In addition, by assuming the Kurdyka-${\L}$ojasiewicz exponent of the objective in our problem, we further analyze the local convergence rate of two special cases of PGels, including the widely used non-monotone proximal gradient method as one case. Finally, we conduct some numerical experiments for solving the $\ell_1$ regularized logistic regression problem and the $\ell_{1\text{-}2}$ regularized least squares problem. Our numerical results illustrate the efficiency of PGels and show the potential advantage of combining two accelerating techniques.
Crash Course in Machine Learning โ IoT For All โ Medium
When you type'machine learning' into Google News, the first link you see is a Forbes Magazine piece called "What's The Difference Between Machine Learning And Artificial Intelligence?" This article contained so many flowery, grandiose descriptions about ML and AI technology that I couldn't help but laugh. With all the nonsense used to describe machine learning (ML) and artificial intelligence (AI), it's time we do a deep dive into what these technologies actually do. First, we need to learn the difference between AI and ML. Fortunately, a fellow writer has already written an excellent explanation here.