Regression
Step-By-Step: Getting Started with Azure Machine Learning
Artificial Intelligence (AI) study and use is on the rise. Tools to enable AI are becoming more readily available, simpler to use and easier to implement. What's more is that the definition of AI itself has been broken down into ingredients that, when later applied into a recipe (or process), can provide multiple desired outcomes. One of the more important ingredients used in most recipes is Machine Learning. Machine Learning in essence is a way of teaching computers to provide more accurate predictions on provided data.
Sparse Regression and Adaptive Feature Generation for the Discovery of Dynamical Systems
We study the performance of sparse regression methods and propose new techniques to distill the governing equations of dynamical systems from data. We first look at the generic methodology of learning interpretable equation forms from data, proposed by Brunton et al., followed by performance of LASSO for this purpose. We then propose a new algorithm that uses the dual of LASSO optimization for higher accuracy and stability. In the second part, we propose a novel algorithm that learns the candidate function library in a completely data-driven manner to distill the governing equations of the dynamical system. This is achieved via sequentially thresholded ridge regression (STRidge) over a orthogonal polynomial space. The performance of the three discussed methods is illustrated by looking the Lorenz 63 system and the quadratic Lorenz system.
Consistent Risk Estimation in High-Dimensional Linear Regression
Xu, Ji, Maleki, Arian, Rad, Kamiar Rahnama
Risk estimation is at the core of many learning systems. The importance of this problem has motivated researchers to propose different schemes, such as cross validation, generalized cross validation, and Bootstrap. The theoretical properties of such estimates have been extensively studied in the low-dimensional settings, where the number of predictors $p$ is much smaller than the number of observations $n$. However, a unifying methodology accompanied with a rigorous theory is lacking in high-dimensional settings. This paper studies the problem of risk estimation under the high-dimensional asymptotic setting $n,p \rightarrow \infty$ and $n/p \rightarrow \delta$ ($\delta$ is a fixed number), and proves the consistency of three risk estimates that have been successful in numerical studies, i.e., leave-one-out cross validation (LOOCV), approximate leave-one-out (ALO), and approximate message passing (AMP)-based techniques. A corner stone of our analysis is a bound that we obtain on the discrepancy of the `residuals' obtained from AMP and LOOCV. This connection not only enables us to obtain a more refined information on the estimates of AMP, ALO, and LOOCV, but also offers an upper bound on the convergence rate of each estimate.
iFair: Learning Individually Fair Data Representations for Algorithmic Decision Making
Lahoti, Preethi, Gummadi, Krishna P., Weikum, Gerhard
People are rated and ranked, towards algorithmic decision making in an increasing number of applications, typically based on machine learning. Research on how to incorporate fairness into such tasks has prevalently pursued the paradigm of group fairness: giving adequate success rates to specifically protected groups. In contrast, the alternative paradigm of individual fairness has received relatively little attention, and this paper advances this less explored direction. The paper introduces a method for probabilistically mapping user records into a low-rank representation that reconciles individual fairness and the utility of classifiers and rankings in downstream applications. Our notion of individual fairness requires that users who are similar in all task-relevant attributes such as job qualification, and disregarding all potentially discriminating attributes such as gender, should have similar outcomes. We demonstrate the versatility of our method by applying it to classification and learning-to-rank tasks on a variety of real-world datasets. Our experiments show substantial improvements over the best prior work for this setting.
Robust Regression via Online Feature Selection under Adversarial Data Corruption
Zhang, Xuchao, Lei, Shuo, Zhao, Liang, Boedihardjo, Arnold P., Lu, Chang-Tien
The presence of data corruption in user-generated streaming data, such as social media, motivates a new fundamental problem that learns reliable regression coefficient when features are not accessible entirely at one time. Until now, several important challenges still cannot be handled concurrently: 1) corrupted data estimation when only partial features are accessible; 2) online feature selection when data contains adversarial corruption; and 3) scaling to a massive dataset. This paper proposes a novel RObust regression algorithm via Online Feature Selection (\textit{RoOFS}) that concurrently addresses all the above challenges. Specifically, the algorithm iteratively updates the regression coefficients and the uncorrupted set via a robust online feature substitution method. We also prove that our algorithm has a restricted error bound compared to the optimal solution. Extensive empirical experiments in both synthetic and real-world datasets demonstrated that the effectiveness of our new method is superior to that of existing methods in the recovery of both feature selection and regression coefficients, with very competitive efficiency.
How It Feels to Learn Data Science in 2019 โ Towards Data Science
So I just have to buy a Tableau license and I'm now a data scientist? Okay, let's just take that sales pitch with a grain of salt. I may be clueless, but I know there is more to data science than making pretty visualizations. I can even do that in Excel. You got to admit it is slick marketing though. Charting data is the fun stage, and they leave out the painful and time-consuming parts of working with data: cleaning, wrangling, transforming, and loading it. God help you if you need to write a specialized algorithm with your own domain logic when using closed tools. Yes, and that is why I suspect there is value in learning to code. Maybe you can learn Alteryx.
New Risk Bounds for 2D Total Variation Denoising
Chatterjee, Sabyasachi, Goswami, Subhajit
2D Total Variation Denoising (TVD) is a widely used technique for image denoising. It is also an important non parametric regression method for estimating functions with heterogenous smoothness. Recent results have shown the TVD estimator to be nearly minimax rate optimal for the class of functions with bounded variation. In this paper, we complement these worst case guarantees by investigating the adaptivity of the TVD estimator to functions which are piecewise constant on axis aligned rectangles. We rigorously show that, when the truth is piecewise constant, the ideally tuned TVD estimator performs better than in the worst case. We also study the issue of choosing the tuning parameter. In particular, we propose a fully data driven version of the TVD estimator which enjoys similar worst case risk guarantees as the ideally tuned TVD estimator.
10 Statistical Techniques Data Scientists Should Master AISOMA AG Frankfurt
The more statistical techniques a Data Scientist has mastered, the better the results can be. In this blog article, we want to introduce you to ten common techniques that should not be missing in the repertoire of a data scientist. In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.
Prediction of Industrial Process Parameters using Artificial Intelligence Algorithms
Khdoudi, Abdelmoula, Masrour, Tawfik
In the present paper, a method of defining the industrial process parameters for a new product using machine learning algorithms will be presented. The study will describe how to go from the product characteristics till the prediction of the suitable machine parameters to produce a good quality of this product, and this is based on an historical training dataset of similar products with their respective process parameters. In the first part of our study, we will focus on the ultrasonic welding process definition, welding parameters and on how it operate. While in second part, we present the design and implementation of the prediction models such multiple linear regression, support vector regression, and we compare them to an artificial neural networks algorithm. In the following part, we present a new application of Convolutional Neural Networks (CNN) to the industrial process parameters prediction. In addition, we will propose the generalization approach of our CNN to any prediction problem of industrial process parameters. Finally the results of the four methods will be interpreted and discussed.