Goto

Collaborating Authors

 Regression


Predictive Complexity Priors

arXiv.org Machine Learning

Specifying a Bayesian prior is notoriously difficult for complex models such as neural networks. Reasoning about parameters is made challenging by the high-dimensionality and over-parameterization of the space. Priors that seem benign and uninformative can have unintuitive and detrimental effects on a model's predictions. For this reason, we propose predictive complexity priors: a functional prior that is defined by comparing the model's predictions to those of a reference model. Although originally defined on the model outputs, we transfer the prior to the model parameters via a change of variables. The traditional Bayesian workflow can then proceed as usual. We apply our predictive complexity prior to high-dimensional regression, reasoning over neural network depth, and sharing of statistical strength for few-shot learning.


Differentially Private (Gradient) Expectation Maximization Algorithm with Statistical Guarantees

arXiv.org Machine Learning

(Gradient) Expectation Maximization (EM) is a widely used algorithm for estimating the maximum likelihood of mixture models or incomplete data problems. A major challenge facing this popular technique is how to effectively preserve the privacy of sensitive data. Previous research on this problem has already lead to the discovery of some Differentially Private (DP) algorithms for (Gradient) EM. However, unlike in the non-private case, existing techniques are not yet able to provide finite sample statistical guarantees. To address this issue, we propose in this paper the first DP version of (Gradient) EM algorithm with statistical guarantees. Moreover, we apply our general framework to three canonical models: Gaussian Mixture Model (GMM), Mixture of Regressions Model (MRM) and Linear Regression with Missing Covariates (RMC). Specifically, for GMM in the DP model, our estimation error is near optimal in some cases. For the other two models, we provide the first finite sample statistical guarantees. Our theory is supported by thorough numerical experiments.


Transfer Learning in Large-scale Gaussian Graphical Models with False Discovery Rate Control

arXiv.org Machine Learning

Gaussian graphical models (GGMs), which represent the dependence structure among a set of random variables, have been widely used to model the conditional dependence relationships in many applications, including gene regulatory networks and brain connectivity maps (Drton and Maathuis, 2017; Varoquaux et al., 2010; Zhao et al., 2014; Glymour et al., 2019). In the classical setting with data from a single study, the estimation of high-dimensional GGMs has been well studied in a series of papers, including penalized likelihood methods (Yuan and Lin, 2007; Lam and Fan, 2009; Friedman et al., 2008; Rothman et al., 2008) and convex optimization based methods (Cai et al., 2011, 2016; Liu and Wang, 2017). The minimax optimal rates are studied in Cai et al. (2016) and a review can be found in Cai (2017). Liu (2013) considers the inference in GGMs based on a node-wise regression approach and Ren et al. (2015) studies the estimation optimality and inference for individual entries. Methods for estimating a single GGM have also been extended to simultaneously estimating multiple graphs when data from multiple studies are available. For example, Guo et al. (2011); Danaher et al. (2014); Cai et al. (2016) consider jointly estimating multiple GGMs with some penalties for inducing common structures among different graphs.


Conditional Density Estimation via Weighted Logistic Regressions

arXiv.org Machine Learning

Compared to the conditional mean as a simple point estimator, the conditional density function is more informative to describe the distributions with multi-modality, asymmetry or heteroskedasticity. In this paper, we propose a novel parametric conditional density estimation method by showing the connection between the general density and the likelihood function of inhomogeneous Poisson process models. The maximum likelihood estimates can be obtained via weighted logistic regressions, and the computation can be significantly relaxed by combining a block-wise alternating maximization scheme and local case-control sampling. We also provide simulation studies for illustration.


JSRT: James-Stein Regression Tree

arXiv.org Machine Learning

Regression tree (RT) has been widely used in machine learning and data mining community. Given a target data for prediction, a regression tree is first constructed based on a training dataset before making prediction for each leaf node. In practice, the performance of RT relies heavily on the local mean of samples from an individual node during the tree construction/prediction stage, while neglecting the global information from different nodes, which also plays an important role. To address this issue, we propose a novel regression tree, named James-Stein Regression Tree (JSRT) by considering global information from different nodes. Specifically, we incorporate the global mean information based on James-Stein estimator from different nodes during the construction/predicton stage. Besides, we analyze the generalization error of our method under the mean square error (MSE) metric. Extensive experiments on public benchmark datasets verify the effectiveness and efficiency of our method, and demonstrate the superiority of our method over other RT prediction methods.


Gaussian Gated Linear Networks

arXiv.org Machine Learning

We propose the Gaussian Gated Linear Network (G-GLN), an extension to the recently proposed GLN family of deep neural networks. Instead of using backpropagation to learn features, GLNs have a distributed and local credit assignment mechanism based on optimizing a convex objective. This gives rise to many desirable properties including universality, data-efficient online learning, trivial interpretability and robustness to catastrophic forgetting. We extend the GLN framework from classification to multiple regression and density modelling by generalizing geometric mixing to a product of Gaussian densities. The G-GLN achieves competitive or state-of-the-art performance on several univariate and multivariate regression benchmarks, and we demonstrate its applicability to practical tasks including online contextual bandits and density estimation via denoising.


Distributed Learning of Finite Gaussian Mixtures

arXiv.org Machine Learning

Advances in information technology have led to extremely large datasets that are often kept in different storage centers. Existing statistical methods must be adapted to overcome the resulting computational obstacles while retaining statistical validity and efficiency. Split-and-conquer approaches have been applied in many areas, including quantile processes, regression analysis, principal eigenspaces, and exponential families. We study split-and-conquer approaches for the distributed learning of finite Gaussian mixtures. We recommend a reduction strategy and develop an effective MM algorithm. The new estimator is shown to be consistent and retains root-n consistency under some general conditions. Experiments based on simulated and real-world data show that the proposed split-and-conquer approach has comparable statistical performance with the global estimator based on the full dataset, if the latter is feasible. It can even slightly outperform the global estimator if the model assumption does not match the real-world data. It also has better statistical and computational performance than some existing methods.


On the Adversarial Robustness of LASSO Based Feature Selection

arXiv.org Machine Learning

In this paper, we investigate the adversarial robustness of feature selection based on the $\ell_1$ regularized linear regression model, namely LASSO. In the considered model, there is a malicious adversary who can observe the whole dataset, and then will carefully modify the response values or the feature matrix in order to manipulate the selected features. We formulate the modification strategy of the adversary as a bi-level optimization problem. Due to the difficulty of the non-differentiability of the $\ell_1$ norm at the zero point, we reformulate the $\ell_1$ norm regularizer as linear inequality constraints. We employ the interior-point method to solve this reformulated LASSO problem and obtain the gradient information. Then we use the projected gradient descent method to design the modification strategy. In addition, We demonstrate that this method can be extended to other $\ell_1$ based feature selection methods, such as group LASSO and sparse group LASSO. Numerical examples with synthetic and real data illustrate that our method is efficient and effective.


Time Series Extrinsic Regression

arXiv.org Machine Learning

This paper studies Time Series Extrinsic Regression (TSER): a regression task of which the aim is to learn the relationship between a time series and a continuous scalar variable; a task closely related to time series classification (TSC), which aims to learn the relationship between a time series and a categorical class label. This task generalizes time series forecasting (TSF), relaxing the requirement that the value predicted be a future value of the input series or primarily depend on more recent values. In this paper, we motivate and study this task, and benchmark existing solutions and adaptations of TSC algorithms on a novel archive of 19 TSER datasets which we have assembled. Our results show that the state-of-the-art TSC algorithm Rocket, when adapted for regression, achieves the highest overall accuracy compared to adaptations of other TSC algorithms and state-of-the-art machine learning (ML) algorithms such as XGBoost, Random Forest and Support Vector Regression. More importantly, we show that much research is needed in this field to improve the accuracy of ML models. We also find evidence that further research has excellent prospects of improving upon these straightforward baselines.


How to Explain Key Machine Learning Algorithms at an Interview - KDnuggets

#artificialintelligence

Linear Regression involves finding a'line of best fit' that represents a dataset using the least squares method. The least squares method involves finding a linear equation that minimizes the sum of squared residuals. A residual is equal to the actual minus predicted value. To give an example, the red line is a better line of best fit than the green line because it is closer to the points, and thus, the residuals are smaller. Ridge regression, also known as L2 Regularization, is a regression technique that introduces a small amount of bias to reduce overfitting.