Regression
Data-driven Localization and Estimation of Disturbance in the Interconnected Power System
Lee, Hyang-Won, Zhang, Jianan, Modiano, Eytan
Identifying the location of a disturbance and its magnitude is an important component for stable operation of power systems. We study the problem of localizing and estimating a disturbance in the interconnected power system. We take a model-free approach to this problem by using frequency data from generators. Specifically, we develop a logistic regression based method for localization and a linear regression based method for estimation of the magnitude of disturbance. Our model-free approach does not require the knowledge of system parameters such as inertia constants and topology, and is shown to achieve highly accurate localization and estimation performance even in the presence of measurement noise and missing data.
Similarity encoding for learning with dirty categorical variables
Cerda, Patricio, Varoquaux, Gaรซl, Kรฉgl, Balรกzs
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.
Linear Regression
Predictive models are extremely useful for forecasting future outcomes and estimating metrics that are impractical to measure. For example, data scientists could use predictive models to forecast crop yields based on rainfall and temperature, or to determine whether patients with certain traits are more likely to react badly to a new medication. Before we talk about linear regression specifically, let's remind ourselves what a typical data science workflow might look like. A lot of the time, we'll start with a question we want to answer, and do something like the following: Linear regression is one of the simplest and most common supervised machine learning algorithms that data scientists use for predictive modeling. In this post, we'll use linear regression to build a model that predicts cherry tree volume from metrics that are much easier for folks who study trees to measure. This post is part of our focus on nature data this month.
Causal Inference with Noisy and Missing Covariates via Matrix Factorization
Kallus, Nathan, Mao, Xiaojie, Udell, Madeleine
Valid causal inference in observational studies often requires controlling for confounders. However, in practice measurements of confounders may be noisy, and can lead to biased estimates of causal effects. We show that we can reduce the bias caused by measurement noise using a large number of noisy measurements of the underlying confounders. We propose the use of matrix factorization to infer the confounders from noisy covariates, a flexible and principled framework that adapts to missing values, accommodates a wide variety of data types, and can augment many causal inference methods. We bound the error for the induced average treatment effect estimator and show it is consistent in a linear regression setting, using Exponential Family Matrix Completion preprocessing. We demonstrate the effectiveness of the proposed procedure in numerical experiments with both synthetic data and real clinical data.
Structural Learning of Multivariate Regression Chain Graphs via Decomposition
Javidian, Mohammad Ali, Valtorta, Marco
We extend the decomposition approach for learning Bayesian networks (BN) proposed by (Xie et al., 2006) to learning multivariate regression chain graphs (MVR CGs), which include BNs as a special case. The same advantages of this decomposition approach hold in the more general setting: reduces complexity and increased power of computational independence tests. Moreover, latent (hidden) variables can be represented in MVR CGs by using bidirected edges, and our algorithm correctly recovers any independence structure that is faithful to a MVR CG, thus greatly extending the range of applications of decomposition-based model selection techniques. While our new algorithm has the same complexity as the one in (Xie et al., 2006) for BNs, it requires larger components for general MVR CGs, to insure that sufficient data is present to estimate parameters.
Analysis of regularized Nystr\"om subsampling for regression functions of low smoothness
Lu, Shuai, Mathรฉ, Peter, Pereverzyev, Sergiy Jr
This paper studies a Nystr\"om type subsampling approach to large kernel learning methods in the misspecified case, where the target function is not assumed to belong to the reproducing kernel Hilbert space generated by the underlying kernel. This case is less understood, in spite of its practical importance. To model such a case, the smoothness of target functions is described in terms of general source conditions. It is surprising that almost for the whole range of the source conditions, describing the misspecified case, the corresponding learning rate bounds can be achieved with just one value of the regularization parameter. This observation allows a formulation of mild conditions under which the plain Nystr\"om subsampling can be realized with subquadratic cost maintaining the guaranteed learning rates.
The Logistic Regression Algorithm
Logistic Regression is one of the most used Machine Learning algorithms for binary classification. It is a simple Algorithm that you can use as a performance baseline, it is easy to implement and it will do well enough in many tasks. Therefore every Machine Learning engineer should be familiar with its concepts. The building block concepts of Logistic Regression can also be helpful in deep learning while building neural networks. In this post, you will learn what Logistic Regression is, how it works, what are advantages and disadvantages and much more.
Locally Interpretable Models and Effects based on Supervised Partitioning (LIME-SUP)
Hu, Linwei, Chen, Jie, Nair, Vijayan N., Sudjianto, Agus
Supervised Machine Learning (SML) algorithms such as Gradient Boosting, Random Forest, and Neural Networks have become popular in recent years due to their increased predictive performance over traditional statistical methods. This is especially true with large data sets (millions or more observations and hundreds to thousands of predictors). However, the complexity of the SML models makes them opaque and hard to interpret without additional tools. There has been a lot of interest recently in developing global and local diagnostics for interpreting and explaining SML models. In this paper, we propose locally interpretable models and effects based on supervised partitioning (trees) referred to as LIME-SUP. This is in contrast with the KLIME approach that is based on clustering the predictor space. We describe LIME-SUP based on fitting trees to the fitted response (LIM-SUP-R) as well as the derivatives of the fitted response (LIME-SUP-D). We compare the results with KLIME and describe its advantages using simulation and real data.
Signal and Noise Statistics Oblivious Orthogonal Matching Pursuit
Kallummil, Sreejith, Kalyani, Sheetal
Orthogonal matching pursuit (OMP) is a widely used algorithm for recovering sparse high dimensional vectors in linear regression models. The optimal performance of OMP requires \textit{a priori} knowledge of either the sparsity of regression vector or noise statistics. Both these statistics are rarely known \textit{a priori} and are very difficult to estimate. In this paper, we present a novel technique called residual ratio thresholding (RRT) to operate OMP without any \textit{a priori} knowledge of sparsity and noise statistics and establish finite sample and large sample support recovery guarantees for the same. Both analytical results and numerical simulations in real and synthetic data sets indicate that RRT has a performance comparable to OMP with \textit{a priori} knowledge of sparsity and noise statistics.
Using Linear Regression for Predictive Modeling in R
Predictive models are extremely useful for forecasting future outcomes and estimating metrics that are impractical to measure. For example, data scientists could use predictive models to forecast crop yields based on rainfall and temperature, or to determine whether patients with certain traits are more likely to react badly to a new medication. Before we talk about linear regression specifically, let's remind ourselves what a typical data science workflow might look like. A lot of the time, we'll start with a question we want to answer, and do something like the following: Linear regression is one of the simplest and most common supervised machine learning algorithms that data scientists use for predictive modeling. In this post, we'll use linear regression to build a model that predicts cherry tree volume from metrics that are much easier for folks who study trees to measure.