Regression
Online Orthogonal Matching Pursuit
Saad, El Mehdi, Blanchard, Gilles, Arlot, Sylvain
Greedy algorithms for feature selection are widely used for recovering sparse high-dimensional vectors in linear models. In classical procedures, the main emphasis was put on the sample complexity, with little or no consideration of the computation resources required. We present a novel online algorithm: Online Orthogonal Matching Pursuit (OOMP) for online support recovery in the random design setting of sparse linear regression. Our procedure selects features sequentially, alternating between allocation of samples only as needed to candidate features, and optimization over the selected set of variables to estimate the regression coefficients. Theoretical guarantees about the output of this algorithm are proven and its computational complexity is analysed.
Robust Gaussian Process Regression Based on Iterative Trimming
Li, Zhao-Zhou, Li, Lu, Shao, Zhengyi
The model prediction of the Gaussian process (GP) regression can be significantly biased when the data are contaminated by outliers. We propose a new robust GP regression algorithm that iteratively trims a portion of the data points with the largest deviation from the predicted mean. While the new algorithm retains the attractive properties of the standard GP as a nonparametric and flexible regression method, it can significantly reduce the influence of outliers even in some extreme cases. It is also easier to implement than previous robust GP variants that rely on approximate inference. Applied to various synthetic datasets with contaminations, the proposed method outperforms the standard GP and the popular robust GP variant with the Student's t likelihood, especially when the outlier fraction is high. Lastly, as a practical example in the astrophysical study, we show that this method can determine the main-sequence ridge line precisely in the color-magnitude diagram of star clusters.
The Confusion Behind Logistic Regression
Logistic Regression is the most confusing unsupervised Machine learning Algorithm. As it is a Classification type Algorithm, the word Regression in its naming often tends to confuse the newbies in the Data Science world. We know that Linear Regression is used to predict the output, which consists of continuous values. However, imagine a situation if we want to classify our continuous outputs into various classes. E.g., if we are data of marks of students and based on the percentage output, we want to classify whether it's pass or fail.
Mistakes in Applying Univariate Feature Selection Methods
In Python scikit-learn library, there are various univariate feature selection methods such as Regression F-score, ANOVA and Chi-squared. Perhaps due to the ease of applying these methods (sometimes with just a single line of code), it might be tempting to just use these methods without taking into consideration the type of features you have. I have seen some machine learning practitioners took this for granted and made this mistake (including myself). While the scikit-learn documentation is clear on which feature selection method should be used for regression and classification, it does not specify whether these methods are suitable to apply to both continuous and categorical features. Let's say you have a classification task and after reading the documentation, you know you should use either Chi-squared test or ANOVA.
On the Benefits of Multiple Gossip Steps in Communication-Constrained Decentralized Optimization
Hashemi, Abolfazl, Acharya, Anish, Das, Rudrajit, Vikalo, Haris, Sanghavi, Sujay, Dhillon, Inderjit
In decentralized optimization, it is common algorithmic practice to have nodes interleave (local) gradient descent iterations with gossip (i.e. averaging over the network) steps. Motivated by the training of large-scale machine learning models, it is also increasingly common to require that messages be {\em lossy compressed} versions of the local parameters. In this paper, we show that, in such compressed decentralized optimization settings, there are benefits to having {\em multiple} gossip steps between subsequent gradient iterations, even when the cost of doing so is appropriately accounted for e.g. by means of reducing the precision of compressed information. In particular, we show that having $O(\log\frac{1}{\epsilon})$ gradient iterations {with constant step size} - and $O(\log\frac{1}{\epsilon})$ gossip steps between every pair of these iterations - enables convergence to within $\epsilon$ of the optimal value for smooth non-convex objectives satisfying Polyak-\L{}ojasiewicz condition. This result also holds for smooth strongly convex objectives. To our knowledge, this is the first work that derives convergence results for nonconvex optimization under arbitrary communication compression.
Benign Overfitting in Binary Classification of Gaussian Mixtures
Wang, Ke, Thrampoulidis, Christos
Deep neural networks generalize well despite being exceedingly overparametrized, but understanding the statistical principles behind this so called benign-overfitting phenomenon is not yet well understood. Recently there has been remarkable progress towards understanding benign-overfitting in simpler models, such as linear regression and, even more recently, linear classification. This paper studies benign-overfitting for data generated from a popular binary Gaussian mixtures model (GMM) and classifiers trained by support-vector machines (SVM). Our approach has two steps. First, we leverage an idea introduced in (Muthukumar et al. 2020) to relate the SVM solution to the least-squares (LS) solution. Second, we derive novel non-asymptotic bounds on the classification error of LS solution. Combining the two gives sufficient conditions on the overparameterization ratio and the signal-to-noise ratio that lead to benign overfitting. We corroborate our theoretical findings with numerical simulations.
Logistic Regression with PyTorch
We learned about linear regression in the last post, and now we move to logistic regression. As we cannot use linear regression for a classification task, we use logistic, which is an extension to linear regression for a classification task. Logistic regression is a good approach to problems that require probability as output. Suppose we create a model to predict a striker's probability when he is playing an away match. If a model predicts a p(score Away_Match) of 0.05, then overall away matches, the striker will score approximately 1 goal.
How do you know what independent variables to include in a regression model?
After a few biostatistics classes, I began fitting my first logistic regression model using my physician friend's data on tumors excised from skin cancer patients. I realized that although we were very clear about the dependent variable we were trying to predict – a certain feature of the tumor – I really did not know how to pick the independent variables that belonged in the model. We only had a few to choose from in our dataset, and I put them all into the model, but I wasn't really sure what to do next. Remove the ones that had a slope with a p 0.05? I asked one of my professors what to do, and in her own idiosyncratic way, she seemed to describe what I will call "stepwise selection".
Machine-Learning Number Fields
He, Yang-Hui, Lee, Kyu-Hwan, Oliver, Thomas
We show that standard machine-learning algorithms may be trained to predict certain invariants of algebraic number fields to high accuracy. A random-forest classifier that is trained on finitely many Dedekind zeta coefficients is able to distinguish between real quadratic fields with class number 1 and 2, to 0.96 precision. Furthermore, the classifier is able to extrapolate to fields with discriminant outside the range of the training data. When trained on the coefficients of defining polynomials for Galois extensions of degrees 2, 6, and 8, a logistic regression classifier can distinguish between Galois groups and predict the ranks of unit groups with precision >0.97.
Data Driven Reaction Mechanism Estimation via Transient Kinetics and Machine Learning
Kunz, M. Ross, Yonge, Adam, Fang, Zongtang, Medford, Andrew J., Constales, Denis, Yablonsky, Gregory, Fushimi, Rebecca
Understanding the set of elementary steps and kinetics in each reaction is extremely valuable to make informed decisions about creating the next generation of catalytic materials. With physical and mechanistic complexity of industrial catalysts, it is critical to obtain kinetic information through experimental methods. As such, this work details a methodology based on the combination of transient rate/concentration dependencies and machine learning to measure the number of active sites, the individual rate constants, and gain insight into the mechanism under a complex set of elementary steps. This new methodology was applied to simulated transient responses to verify its ability to obtain correct estimates of the micro-kinetic coefficients. Furthermore, experimental CO oxidation data was analyzed to reveal the Langmuir-Hinshelwood mechanism driving the reaction. As oxygen accumulated on the catalyst, a transition in the mechanism was clearly defined in the machine learning analysis due to the large amount of kinetic information available from transient reaction techniques. This methodology is proposed as a new data driven approach to characterize how materials control complex reaction mechanisms relying exclusively on experimental data.