Regression
Ultrahigh dimensional instrument detection using graph learning: an application to high dimensional GIS-census data for house pricing
Xu, Ning, Fisher, Timothy C. G., Hong, Jian
The exogeneity bias and instrument validation have always been critical topics in statistics, machine learning and biostatistics. In the era of big data, such issues typically come with dimensionality issue and, hence, require even more attention than ever. In this paper we ensemble two well-known tools from machine learning and biostatistics -- stable variable selection and random graph -- and apply them to estimating the house pricing mechanics and the follow-up socio-economic effect on the 2010 Sydney house data. The estimation is conducted on an over-200-gigabyte ultrahigh dimensional database consisting of local education data, GIS information, census data, house transaction and other socio-economic records. The technique ensemble carefully improves the variable selection sparisty, stability and robustness to high dimensionality, complicated causal structures and the consequent multicollinearity, which is ultimately helpful on the data-driven recovery of a sparse and intuitive causal structure. The new ensemble also reveals its efficiency and effectiveness on endogeneity detection, instrument validation, weak instruments pruning and selection of proper instruments. From the perspective of machine learning, the estimation result both aligns with and confirms the facts of Sydney house market, the classical economic theories and the previous findings of simultaneous equations modeling. Moreover, the estimation result is totally consistent with and supported by the classical econometric tool like two-stage least square regression and different instrument tests (the code can be found at https://github.com/isaac2math/solar_graph_learning).
Solar: a least-angle regression for accurate and stable variable selection in high-dimensional data
Xu, Ning, Fisher, Timothy C. G., Hong, Jian
We propose a new least-angle regression algorithm for variable selection in high-dimensional data, called \emph{subsample-ordered least-angle regression (solar)}. Solar relies on the average $L_0$ solution path computed across subsamples and largely alleviates several known high-dimensional issues with least-angle regression. Using examples based on directed acyclic graphs, we illustrate the advantages of solar in comparison to least-angle regression, forward regression and variable screening. Simulations demonstrate that, with a similar computation load, solar yields substantial improvements over two lasso solvers (least-angle regression for lasso and coordinate-descent) in terms of the sparsity (37-64\% reduction in the average number of selected variables), stability and accuracy of variable selection. Simulations also demonstrate that solar enhances the robustness of variable selection to different settings of the irrepresentable condition and to variations in the dependence structures assumed in regression analysis. We provide a Python package \texttt{solarpy} for the algorithm.
Random Forests for dependent data
Saha, Arkajyoti, Basu, Sumanta, Datta, Abhirup
Random forest (RF) is one of the most popular methods for estimating regression functions. The local nature of the RF algorithm, based on intra-node means and variances, is ideal when errors are i.i.d. For dependent error processes like time series and spatial settings where data in all the nodes will be correlated, operating locally ignores this dependence. Also, RF will involve resampling of correlated data, violating the principles of bootstrap. Theoretically, consistency of RF has been established for i.i.d. errors, but little is known about the case of dependent errors. We propose RF-GLS, a novel extension of RF for dependent error processes in the same way Generalized Least Squares (GLS) fundamentally extends Ordinary Least Squares (OLS) for linear models under dependence. The key to this extension is the equivalent representation of the local decision-making in a regression tree as a global OLS optimization which is then replaced with a GLS loss to create a GLS-style regression tree. This also synergistically addresses the resampling issue, as the use of GLS loss amounts to resampling uncorrelated contrasts (pre-whitened data) instead of the correlated data. For spatial settings, RF-GLS can be used in conjunction with Gaussian Process correlated errors to generate kriging predictions at new locations. RF becomes a special case of RF-GLS with an identity working covariance matrix. We establish consistency of RF-GLS under beta- (absolutely regular) mixing error processes and show that this general result subsumes important cases like autoregressive time series and spatial Matern Gaussian Processes. As a byproduct, we also establish consistency of RF for beta-mixing processes, which to our knowledge, is the first such result for RF under dependence. We empirically demonstrate the improvement achieved by RF-GLS over RF for both estimation and prediction under dependence.
Learning from Data to Speed-up Sorted Table Search Procedures: Methodology and Practical Guidelines
Amato, Domenico, Bosco, Giosuรฉ Lo, Giancarlo, Raffaele
Sorted Table Search Procedures are the quintessential query-answering tool, with widespread usage that now includes also Web Applications, e.g, Search Engines (Google Chrome) and ad Bidding Systems (AppNexus). Speeding them up, at very little cost in space, is still a quite significant achievement. Here we study to what extend Machine Learning Techniques can contribute to obtain such a speed-up via a systematic experimental comparison of known efficient implementations of Sorted Table Search procedures, with different Data Layouts, and their Learned counterparts developed here. We characterize the scenarios in which those latter can be profitably used with respect to the former, accounting for both CPU and GPU computing. Our approach contributes also to the study of Learned Data Structures, a recent proposal to improve the time/space performance of fundamental Data Structures, e.g., B-trees, Hash Tables, Bloom Filters. Indeed, we also formalize an Algorithmic Paradigm of Learned Dichotomic Sorted Table Search procedures that naturally complements the Learned one proposed here and that characterizes most of the known Sorted Table Search Procedures as having a "learning phase" that approximates Simple Linear Regression.
Logistic Regression from scratch
You can find working code examples (including this one) in my lab repository on GitHub. Sometimes it's necessary to split existing data into several classes in order to predict new, unseen data. This problem is called classification and one of the algorithms which can be used to learn those classes from data is called Logistic Regression. In this article we'll take a deep dive into the Logistic Regression model to learn how it differs from other regression models such as Linear- or Multiple Linear Regression, how to think about it from an intuitive perspective and how we can translate our learnings into code while implementing it from scratch. If you've read the post about Linear- and Multiple Linear Regression you might remember that the main objective of our algorithm was to find a best fitting line or hyperplane respectively.
Prediction of hierarchical time series using structured regularization and its application to artificial neural networks
Shiratori, Tomokaze, Kobayashi, Ken, Takano, Yuichi
This paper discusses the prediction of hierarchical time series, where each upper-level time series is calculated by summing appropriate lower-level time series. Forecasts for such hierarchical time series should be coherent, meaning that the forecast for an upper-level time series equals the sum of forecasts for corresponding lower-level time series. Previous methods for making coherent forecasts consist of two phases: first computing base (incoherent) forecasts and then reconciling those forecasts based on their inherent hierarchical structure. With the aim of improving time series predictions, we propose a structured regularization method for completing both phases simultaneously. The proposed method is based on a prediction model for bottom-level time series and uses a structured regularization term to incorporate upper-level forecasts into the prediction model. We also develop a backpropagation algorithm specialized for application of our method to artificial neural networks for time series prediction. Experimental results using synthetic and real-world datasets demonstrate the superiority of our method in terms of prediction accuracy and computational efficiency.
Advantages and Disadvantages of Linear Regression
Linear regression is a simple Supervised Learning algorithm that is used to predict the value of a dependent variable(y) for a given value of the independent variable(x) by effectively modelling a linear relationship(of the form: y mx c) between the input(x) and output(y) variables using the given dataset. In this article we will be discussing the advantages and disadvantages of linear regression. Linear Regression is a very simple algorithm that can be implemented very easily to give satisfactory results.Furthermore, these models can be trained easily and efficiently even on systems with relatively low computational power when compared to other complex algorithms.Linear regression has a considerably lower time complexity when compared to some of the other machine learning algorithms.The mathematical equations of Linear regression are also fairly easy to understand and interpret.Hence Linear regression is very easy to master. Linear regression fits linearly seperable datasets almost perfectly and is often used to find the nature of the relationship between variables. Overfitting is a situation that arises when a machine learning model fits a dataset very closely and hence captures the noisy data as well.This negatively impacts the performance of model and reduces its accuracy on the test set.
Ensuring the Robustness and Reliability of Data-Driven Knowledge Discovery Models in Production and Manufacturing
Tripathi, Shailesh, Muhr, David, Manuel, Brunner, Emmert-Streib, Frank, Jodlbauer, Herbert, Dehmer, Matthias
The implementation of robust, stable, and user-centered data analytics and machine learning models is confronted by numerous challenges in production and manufacturing. Therefore, a systematic approach is required to develop, evaluate, and deploy such models. The data-driven knowledge discovery framework provides an orderly partition of the data-mining processes to ensure the practical implementation of data analytics and machine learning models. However, the practical application of robust industry-specific data-driven knowledge discovery models faces multiple data-- and model-development--related issues. These issues should be carefully addressed by allowing a flexible, customized, and industry-specific knowledge discovery framework; in our case, this takes the form of the cross-industry standard process for data mining (CRISP-DM). This framework is designed to ensure active cooperation between different phases to adequately address data- and model-related issues. In this paper, we review several extensions of CRISP-DM models and various data-robustness-- and model-robustness--related problems in machine learning, which currently lacks proper cooperation between data experts and business experts because of the limitations of data-driven knowledge discovery models.
Supervised Machine Learning Techniques: An Overview with Applications to Banking
Hu, Linwei, Chen, Jie, Vaughan, Joel, Yang, Hanyu, Wang, Kelly, Sudjianto, Agus, Nair, Vijayan N.
This article provides an overview of Supervised Machine Learning (SML) with a focus on applications to banking. The SML techniques covered include Bagging (Random Forest or RF), Boosting (Gradient Boosting Machine or GBM) and Neural Networks (NNs). We begin with an introduction to ML tasks and techniques. This is followed by a description of: i) tree-based ensemble algorithms including Bagging with RF and Boosting with GBMs, ii) Feedforward NNs, iii) a discussion of hyper-parameter optimization techniques, and iv) machine learning interpretability. The paper concludes with a comparison of the features of different ML algorithms. Examples taken from credit risk modeling in banking are used throughout the paper to illustrate the techniques and interpret the results of the algorithms.
Truncated Linear Regression in High Dimensions
Daskalakis, Constantinos, Rohatgi, Dhruv, Zampetakis, Manolis
As in standard linear regression, in truncated linear regression, we are given access to observations $(A_i, y_i)_i$ whose dependent variable equals $y_i= A_i^{\rm T} \cdot x^* + \eta_i$, where $x^*$ is some fixed unknown vector of interest and $\eta_i$ is independent noise; except we are only given an observation if its dependent variable $y_i$ lies in some "truncation set" $S \subset \mathbb{R}$. The goal is to recover $x^*$ under some favorable conditions on the $A_i$'s and the noise distribution. We prove that there exists a computationally and statistically efficient method for recovering $k$-sparse $n$-dimensional vectors $x^*$ from $m$ truncated samples, which attains an optimal $\ell_2$ reconstruction error of $O(\sqrt{(k \log n)/m})$. As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. This generalizes the works of both: (1) [Daskalakis et al. 2018], where no regularization is needed due to the low-dimensionality of the data, and (2) [Wainright 2009], where the objective function is simple due to the absence of truncation. In order to deal with both truncation and high-dimensionality at the same time, we develop new techniques that not only generalize the existing ones but we believe are of independent interest.