Regression
Fast Generalized Matrix Regression with Applications in Machine Learning
Ye, Haishan, Wang, Shusen, Zhang, Zhihua, Zhang, Tong
Fast matrix algorithms have become the fundamental tools of machine learning in big data era. The generalized matrix regression problem is widely used in the matrix approximation such as CUR decomposition, kernel matrix approximation, and stream singular value decomposition (SVD), etc. In this paper, we propose a fast generalized matrix regression algorithm (Fast GMR) which utilizes sketching technique to solve the GMR problem efficiently. Given error parameter $0<\epsilon<1$, the Fast GMR algorithm can achieve a $(1+\epsilon)$ relative error with the sketching sizes being of order $\cO(\epsilon^{-1/2})$ for a large group of GMR problems. We apply the Fast GMR algorithm to the symmetric positive definite matrix approximation and single pass singular value decomposition and they achieve a better performance than conventional algorithms. Our empirical study also validates the effectiveness and efficiency of our proposed algorithms.
OCCER- One-Class Classification by Ensembles of Regression models
Ahmad, Amir, Bezawada, Srikanth
One-class classification (OCC) deals with the classification problem in which the training data has data points belonging to target class only. In this paper, we present a one-class classification algorithm; One-Class Classification by Ensembles of Regression models (OCCER) that uses regression methods to address OCC problems. The OCCEM algorithm coverts a OCC problem into many regression problems in the original feature space such that each feature of the original feature space is used as the target variable in one of the regression problems. Other features are used as the variables on which the dependent variable is depend upon. The errors of regression of a data point by all the regression models are used to compute the outlier score of the data point. An extensive comparison of the OCCER to the state-of-the-art OCC algorithms on several datasets was carried out to show the effectiveness of the proposed approach. We also show that OCCER algorithm can work well with the latent feature space created by autoencoders for image datasets. The implementation of OCCER is available at https://github.com/srikanthBezawada/OCCER.
Machine learning and its applications in plant molecular studies
The advent of high-throughput genomic technologies has resulted in the accumulation of massive amounts of genomic information. However, biologists are challenged with how to effectively analyze these data. Machine learning can provide tools for better and more efficient data analysis. Unfortunately, because many plant biologists are unfamiliar with machine learning, its application in plant molecular studies has been restricted to a few species and a limited set of algorithms. Thus, in this study, we provide the basic steps for developing machine learning frameworks and present a comprehensive overview of machine learning algorithms and various evaluation metrics. Furthermore, we introduce sources of important curated plant genomic data and R packages to enable plant biologists to easily and quickly apply appropriate machine learning algorithms in their research. Finally, we discuss current applications of machine learning algorithms for identifying various genes related to resistance to biotic and abiotic stress. Broad application of machine learning and the accumulation of plant sequencing data will advance plant molecular studies. The advent of high-throughput sequencing technologies has produced several large-scale data sets. This enormous amount of information enables biologists to explore topics that were once difficult or impossible to investigate, such as associations between microRNA and certain diseases, the causes of vascular inflammation and atherosclerosis in humans [1–3] and stress breeding in plants [4]. However, many challenges have also emerged. For example, the European Bioinformatics Institute now stores 273 petabytes of raw molecular data on humans, plants and animals (https://www.ebi.ac.uk/).
An improper estimator with optimal excess risk in misspecified density estimation and logistic regression
Mourtada, Jaouad, Gaïffas, Stéphane
We introduce a procedure for predictive conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This predictor minimizes a new general excess risk bound, which critically remains valid under model misspecification. On standard examples, this bound scales as $d/n$ where $d$ is the dimension of the model and $n$ the sample size, regardless of the true distribution. The SMP, which is an improper (out-of-model) procedure, improves over proper (within-model) estimators (such as the maximum likelihood estimator), whose excess risk can degrade arbitrarily in the misspecified case. For density estimation, our bounds improve over approaches based on online-to-batch conversion, by removing suboptimal $\log n$ factors, addressing an open problem from Gr{\"u}nwald and Kot{\l}owski (2011) for the considered models. For the Gaussian linear model, the SMP admits an explicit expression, and its expected excess risk in the general misspecified case is at most twice the minimax excess risk in the \emph{well-specified case}, but without any condition on the noise variance or approximation error of the linear model. For logistic regression, a penalized SMP can be computed efficiently by training two logistic regressions, and achieves a non-asymptotic excess risk of $O((d + B^2R^2)/n)$, where $R$ is a bound on the norm of the features and $B$ the norm of the comparison linear predictor. This improves the rates of proper (within-model) estimators, since such procedures can achieve no better rate than $\min(BR/\sqrt{n},de^{BR}/n)$ in general. This also provides a computationally more efficient alternative to approaches based on online-to-batch conversion of Bayesian mixture procedures, which require approximate posterior sampling, thereby partly answering a question by Foster et al. (2018).
Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices
The first part of this paper is devoted to the decision-theoretic analysis of random-design linear prediction with square loss. It is known that, under boundedness constraints on the response (and thus regression coefficients), the minimax excess risk scales as $C\sigma^2d/n$ up to constants, where $d$ is the model dimension, $n$ the sample size, and $\sigma^2$ the noise parameter. Here, we study the expected excess risk with respect to the full linear class. We show that the ordinary least squares (OLS) estimator is minimax optimal in the well-specified case, for every distribution of covariates and noise level. Further, we express the minimax risk in terms of the distribution of statistical leverage scores of individual samples. We deduce a precise minimax lower bound of $\sigma^2d/(n-d+1)$, valid for any distribution of covariates, which nearly matches the risk of OLS for Gaussian covariates. We then obtain nonasymptotic upper bounds on the minimax risk for covariates that satisfy a "small ball"-type regularity condition, which scale as $(1+o(1))\sigma^2d/n$ as $d=o(n)$, both in the well-specified and misspecified cases. Our main technical contribution is the study of the lower tail of the smallest singular value of empirical covariance matrices around $0$. We establish a general lower bound on this lower tail, together with a matching upper bound under a necessary regularity condition. Our proof relies on the PAC-Bayesian technique for controlling empirical processes, and extends an analysis of Oliveira (2016) devoted to a different part of the lower tail. Equivalently, our upper bound shows that the operator norm of the inverse sample covariance matrix has bounded $L^q$ norm up to $q\asymp n$, and this exponent is unimprovable. Finally, we show that the regularity condition on the design naturally holds for independent coordinates.
The 10 Algorithms Data Scientist must have to Know.
Let's say I am given an Excel sheet with data about various fruits and I have to tell which look like Apples. What I will do is ask a question "Which fruits are red and round?" and divide all fruits which answer yes and no to the question. Now, All Red and Round fruits might not be apples and all apples won't be red and round. So I will ask a question "Which fruits have red or yellow color hints on them? " on red and round fruits and will ask "Which fruits are green and round?" on not red and round fruits. Based on these questions I can tell with considerable accuracy which are apples. This cascade of questions is what a decision tree is. However, this is a decision tree based on my intuition.
Shear Stress Distribution Prediction in Symmetric Compound Channels Using Data Mining and Machine Learning Models
Khozani, Zohreh Sheikh, Khosravi, Khabat, Torabi, Mohammadamin, Mosavi, Amir, Rezaei, Bahram, Rabczuk, Timon
Shear stress distribution prediction in open channels is of utmost importance in hydraulic structural engineering as it directly affects the design of stable channels. In this study, at first, a series of experimental tests were conducted to assess the shear stress distribution in prismatic compound channels. The shear stress values around the whole wetted perimeter were measured in the compound channel with different floodplain widths also in different flow depths in subcritical and supercritical conditions. A set of, data mining and machine learning models including Random Forest (RF), M5P, Random Committee (RC), KStar and Additive Regression Model (AR) implemented on attained data to predict the shear stress distribution in the compound channel. Results indicated among these five models, RF method indicated the most precise results with the highest R2 value of 0.9. Finally, the most powerful data mining method which studied in this research (RF) compared with two well-known analytical models of Shiono and Knight Method (SKM) and Shannon method to acquire the proposed model functioning in predicting the shear stress distribution. The results showed that the RF model has the best prediction performance compared to SKM and Shannon models.
Cyanure: An Open-Source Toolbox for Empirical Risk Minimization for Python, C++, and soon more
Cyanure is an open-source C software package with a Python interface. The goal of Cyanure is to provide state-of-the-art solvers for learning linear models, based on stochastic variance-reduced stochastic optimization with acceleration mechanisms. It provides a simple Python API, which is very close to that of scikit-learn, which should be extended to other languages such as R or Matlab in a near future. Cyanure is distributed under BSD-3-Clause license. Even though this is non-legally binding, the author kindly ask users to cite the present arXiv document in their publications, as well as the publication related to the algorithm they have chosen (see Section 4 for the related publications).
A Smooth Introduction to Linear Regression and its Implementation in PyTorch (Part-II)
So in Part-I I gave a simple introduction on what linear regression is and how we can find the equation of the best fit line for our data. In this post, I will show you how to implement the task we worked on in Part-I in PyTorch. The input size is set to 1 since our inputs to the model are scalars h (hour in the day). The output is also set to 1 since we will get only one value returned for r (number of pages being read). So, basically, we will leave our program to find the best values for B_0 and B_1 that we calculated in the previous part of this tutorial.
Bayesian high-dimensional linear regression with generic spike-and-slab priors
Spike-and-slab priors are popular Bayesian solutions for high-dimensional linear regression problems. Previous works on theoretical properties of spike-and-slab methods focus on specific prior formulations and use prior-dependent conditions and analyses, and thus can not be generalized directly. In this paper, we propose a class of generic spike-and-slab priors and develop a unified framework to rigorously assess their theoretical properties. Technically, we provide general conditions under which generic spike-and-slab priors can achieve a nearly-optimal posterior contraction rate and model selection consistency. Our results include those of Castillo et al. (2015) and Narisetty and He (2014) as special cases.