Goto

Collaborating Authors

 Regression


Benign Overfitting in Linear Regression

arXiv.org Machine Learning

The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of gaussian linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.


A comparison of apartment rent price prediction using a large dataset: Kriging versus DNN

arXiv.org Machine Learning

The hedonic approach based on a regression model has been widely adopted for the prediction of real estate property price and rent. In particular, a spatial regression technique called Kriging, a method of interpolation that was advanced in the field of spatial statistics, are known to enable high accuracy prediction in light of the spatial dependence of real estate property data. Meanwhile, there has been a rapid increase in machine learning-based prediction using a large (big) dataset and its effectiveness has been demonstrated in previous studies. However, no studies have ever shown the extent to which predictive accuracy differs for Kriging and machine learning techniques using big data. Thus, this study compares the predictive accuracy of apartment rent price in Japan between the nearest neighbor Gaussian processes (NNGP) model, which enables application of Kriging to big data, and the deep neural network (DNN), a representative machine learning technique, with a particular focus on the data sample size (n = 10^4, 10^5, 10^6) and differences in predictive performance. Our analysis showed that, with an increase in sample size, the out-of-sample predictive accuracy of DNN approached that of NNGP and they were nearly equal on the order of n = 10^6. Furthermore, it is suggested that, for both higher and lower end properties whose rent price deviates from the median, DNN may have a higher predictive accuracy than that of NNGP.


AMF: Aggregated Mondrian Forests for Online Learning

arXiv.org Machine Learning

Introduced by Breiman (2001), Random Forests (RF) is one of the algorithms of choice in many supervised learning applications. The appeal of these methods comes from their remarkable accuracy in a variety of tasks, the small number (or even the absence) of parameters to tune, their reasonable computational cost at training and prediction time, and their suitability in highdimensional settings. Most commonly used RF algorithms, such as the original random forest procedure (Breiman, 2001), extra-trees (Geurts et al., 2006), or conditional inference forest (Hothorn et al., 2010) are batch algorithms, that require the whole dataset to be available at once. Several online random forests variants have been proposed to overcome this issue and handle data that come sequentially. Utgoff (1989) was the first to extend Quinlan's ID3 batch decision tree algorithm (see Quinlan, 1986) to an online setting. Later on, Domingos and Hulten (2000) introduce Hoeffding Trees that can be easily updated: since observations are available sequentially, a cell is split when (i) enough observations have fallen into this cell, (ii) the best split in the cell is statistically relevant (a generic Hoeffding inequality being used to assess the quality of the best split). Since random forests are known to exhibit better empirical performances than individual decision trees, online random forests have been proposed (see, e.g., Saffari et al., 2009; Denil et al., 2013).


A Review of Statistical Learning Machines from ATR to DNA Microarrays: design, assessment, and advice for practitioners

arXiv.org Machine Learning

Statistical Learning is the process of estimating an unknown probabilistic input-output relationship of a system using a limited number of observations; and a statistical learning machine (SLM) is the machine that learned such a process. While their roots grow deeply in Probability Theory, SLMs are ubiquitous in the modern world. Automatic Target Recognition (ATR) in military applications, Computer Aided Diagnosis (CAD) in medical imaging, DNA microarrays in Genomics, Optical Character Recognition (OCR), Speech Recognition (SR), spam email filtering, stock market prediction, etc., are few examples and applications for SLM; diverse fields but one theory. The field of Statistical Learning can be decomposed to two basic subfields, Design and Assessment. Three main groups of specializations-namely statisticians, engineers, and computer scientists (ordered ascendingly by programming capabilities and descendingly by mathematical rigor)-exist on the venue of this field and each takes its elephant bite. Exaggerated rigorous analysis of statisticians sometimes deprives them from considering new ML techniques and methods that, yet, have no "complete" mathematical theory. On the other hand, immoderate add-hoc simulations of computer scientists sometimes derive them towards unjustified and immature results. A prudent approach is needed that has the enough flexibility to utilize simulations and trials and errors without sacrificing any rigor. If this prudent attitude is necessary for this field it is necessary, as well, in other fields of Engineering.


Policy Targeting under Network Interference

arXiv.org Machine Learning

The empirical analysis of experiments and quasi-experiments often seeks to determine the optimal allocation of treatments that maximizes social welfare. In the presence of interference, spillover effects lead to a new formulation of the statistical treatment choice problem. This paper develops a novel method to construct individual-specific optimal treatment allocation rules under network interference. Several features make the proposed methodology particularly appealing for applications: we construct targeting rules that depend on an arbitrary set of individual, neighbors' and network characteristics, and we allow for general constraints on the policy function; we consider heterogeneous direct and spillover effects, arbitrary, possibly non-linear, regression models, and we propose estimators that are robust to model misspecification; the method flexibly accommodates for cases where researchers only observe local information of the network. From a theoretical perspective, we establish the first set of guarantees on the utilitarian regret under interference, and we show that it achieves the min-max optimal rate in scenarios of practical and theoretical interest. We discuss the empirical performance in simulations and we illustrate our method by investigating the role of social networks in micro-finance decisions.


The NN-Stacking: Feature weighted linear stacking through neural networks

arXiv.org Machine Learning

Stacking methods improve the prediction performance of regression models. A simple way to stack base regressions estimators is by combining them linearly, as done by \citet{breiman1996stacked}. Even though this approach is useful from an interpretative perspective, it often does not lead to high predictive power. We propose the NN-Stacking method (NNS), which generalizes Breiman's method by allowing the linear parameters to vary with input features. This improvement enables NNS to take advantage of the fact that distinct base models often perform better at different regions of the feature space. Our method uses neural networks to estimate the stacking coefficients. We show that while our approach keeps the interpretative features of Breiman's method at a local level, it leads to better predictive power, especially in datasets with large sample sizes.


The Value of Collaboration in Convex Machine Learning with Differential Privacy

arXiv.org Machine Learning

In this paper, we apply machine learning to distributed private data owned by multiple data owners, entities with access to non-overlapping training datasets. We use noisy, differentially-private gradients to minimize the fitness cost of the machine learning model using stochastic gradient descent. We quantify the quality of the trained model, using the fitness cost, as a function of privacy budget and size of the distributed datasets to capture the trade-off between privacy and utility in machine learning. This way, we can predict the outcome of collaboration among privacy-aware data owners prior to executing potentially computationally-expensive machine learning algorithms. Particularly, we show that the difference between the fitness of the trained machine learning model using differentially-private gradient queries and the fitness of the trained machine model in the absence of any privacy concerns is inversely proportional to the size of the training datasets squared and the privacy budget squared. We successfully validate the performance prediction with the actual performance of the proposed privacy-aware learning algorithms, applied to: financial datasets for determining interest rates of loans using regression; and detecting credit card frauds using support vector machines.


Asymmetric Random Projections

arXiv.org Machine Learning

Random projections (RP) are a popular tool for reducing dimensionality while preserving local geometry. In many applications the data set to be projected is given to us in advance, yet the current RP techniques do not make use of information about the data. In this paper, we provide a computationally light way to extract statistics from the data that allows designing a data dependent RP with superior performance compared to data-oblivious RP. We tackle scenarios such as matrix multiplication and linear regression/classification in which we wish to estimate inner products between pairs of vectors from two possibly different sources. Our technique takes advantage of the difference between the sources and is provably superior to oblivious RPs. Additionally, we provide extensive experiments comparing RPs with our approach showing significant performance lifts in fast matrix multiplication, regression and classification problems.


Distilling BERT -- How to achieve BERT performance using logistic regression

#artificialintelligence

BERT is awesome, and it's everywhere. It looks like any NLP task can benefit from utilizing BERT. The authors showed that this is indeed the case, and from my experience, it works like magic. It's easy to use, works on a small amount of data and supports many different languages. It seems like there's no single reason not to use it everywhere.


Max-Affine Regression: Provable, Tractable, and Near-Optimal Statistical Estimation

arXiv.org Machine Learning

Max-affine regression refers to a model where the unknown regression function is modeled as a maximum of $k$ unknown affine functions for a fixed $k \geq 1$. This generalizes linear regression and (real) phase retrieval, and is closely related to convex regression. Working within a non-asymptotic framework, we study this problem in the high-dimensional setting assuming that $k$ is a fixed constant, and focus on estimation of the unknown coefficients of the affine functions underlying the model. We analyze a natural alternating minimization (AM) algorithm for the non-convex least squares objective when the design is random. We show that the AM algorithm, when initialized suitably, converges with high probability and at a geometric rate to a small ball around the optimal coefficients. In order to initialize the algorithm, we propose and analyze a combination of a spectral method and a random search scheme in a low-dimensional space, which may be of independent interest. The final rate that we obtain is near-parametric and minimax optimal (up to a poly-logarithmic factor) as a function of the dimension, sample size, and noise variance. In that sense, our approach should be viewed as a direct and implementable method of enforcing regularization to alleviate the curse of dimensionality in problems of the convex regression type. As a by-product of our analysis, we also obtain guarantees on a classical algorithm for the phase retrieval problem under considerably weaker assumptions on the design distribution than was previously known. Numerical experiments illustrate the sharpness of our bounds in the various problem parameters.