AITopics | Regression

Collaborating Authors

Regression

News Overviews Instructional Materials AI-Alerts Classics

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

Freund, Robert M., Grigas, Paul, Mazumder, Rahul

arXiv.org Machine LearningMay-16-2015

In this paper we analyze boosting algorithms in linear regression from a new perspective: that of modern first-order methods in convex optimization. We show that classic boosting algorithms in linear regression, namely the incremental forward stagewise algorithm (FS$_\varepsilon$) and least squares boosting (LS-Boost($\varepsilon$)), can be viewed as subgradient descent to minimize the loss function defined as the maximum absolute correlation between the features and residuals. We also propose a modification of FS$_\varepsilon$ that yields an algorithm for the Lasso, and that may be easily extended to an algorithm that computes the Lasso path for different values of the regularization parameter. Furthermore, we show that these new algorithms for the Lasso may also be interpreted as the same master algorithm (subgradient descent), applied to a regularized version of the maximum absolute correlation loss function. We derive novel, comprehensive computational guarantees for several boosting algorithms in linear regression (including LS-Boost($\varepsilon$) and FS$_\varepsilon$) by using techniques of modern first-order methods in convex optimization. Our computational guarantees inform us about the statistical properties of boosting algorithms. In particular they provide, for the first time, a precise theoretical description of the amount of data-fidelity and regularization imparted by running a boosting algorithm with a prespecified learning rate for a fixed but arbitrary number of iterations, for any dataset.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1505.04243

Country: North America > United States > Massachusetts > Middlesex County (0.28)

Genre: Research Report > New Finding (0.92)

Industry: Health & Medicine (0.94)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

An Analysis of Active Learning With Uniform Feature Noise

Ramdas, Aaditya, Poczos, Barnabas, Singh, Aarti, Wasserman, Larry

arXiv.org Machine LearningMay-15-2015

In active learning, the user sequentially chooses values for feature $X$ and an oracle returns the corresponding label $Y$. In this paper, we consider the effect of feature noise in active learning, which could arise either because $X$ itself is being measured, or it is corrupted in transmission to the oracle, or the oracle returns the label of a noisy version of the query point. In statistics, feature noise is known as "errors in variables" and has been studied extensively in non-active settings. However, the effect of feature noise in active learning has not been studied before. We consider the well-known Berkson errors-in-variables model with additive uniform noise of width $\sigma$. Our simple but revealing setting is that of one-dimensional binary classification setting where the goal is to learn a threshold (point where the probability of a $+$ label crosses half). We deal with regression functions that are antisymmetric in a region of size $\sigma$ around the threshold and also satisfy Tsybakov's margin condition around the threshold. We prove minimax lower and upper bounds which demonstrate that when $\sigma$ is smaller than the minimiax active/passive noiseless error derived in \cite{CN07}, then noise has no effect on the rates and one achieves the same noiseless rates. For larger $\sigma$, the \textit{unflattening} of the regression function on convolution with uniform noise, along with its local antisymmetry around the threshold, together yield a behaviour where noise \textit{appears} to be beneficial. Our key result is that active learning can buy significant improvement over a passive strategy even in the presence of feature noise.

artificial intelligence, machine learning, noise, (17 more...)

arXiv.org Machine Learning

1505.04215

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.55)

Add feedback

Compound Poisson Processes, Latent Shrinkage Priors and Bayesian Nonconvex Penalization

Zhang, Zhihua, Li, Jin

arXiv.org Machine LearningMay-15-2015

In this paper we discuss Bayesian nonconvex penalization for sparse learning problems. We explore a nonparametric formulation for latent shrinkage parameters using subordinators which are one-dimensional L\'{e}vy processes. We particularly study a family of continuous compound Poisson subordinators and a family of discrete compound Poisson subordinators. We exemplify four specific subordinators: Gamma, Poisson, negative binomial and squared Bessel subordinators. The Laplace exponents of the subordinators are Bernstein functions, so they can be used as sparsity-inducing nonconvex penalty functions. We exploit these subordinators in regression problems, yielding a hierarchical model with multiple regularization parameters. We devise ECME (Expectation/Conditional Maximization Either) algorithms to simultaneously estimate regression coefficients and regularization parameters. The empirical evaluation of simulated data shows that our approach is feasible and effective in high-dimensional data analysis.

artificial intelligence, bayesian inference, machine learning, (15 more...)

arXiv.org Machine Learning

doi: 10.1214/14-BA892

1308.6069

Country: Europe > United Kingdom > England > Oxfordshire > Oxford (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Optimal linear estimation under unknown nonlinear transform

Yi, Xinyang, Wang, Zhaoran, Caramanis, Constantine, Liu, Han

arXiv.org Machine LearningMay-13-2015

Linear regression studies the problem of estimating a model parameter $\beta^* \in \mathbb{R}^p$, from $n$ observations $\{(y_i,\mathbf{x}_i)\}_{i=1}^n$ from linear model $y_i = \langle \mathbf{x}_i,\beta^* \rangle + \epsilon_i$. We consider a significant generalization in which the relationship between $\langle \mathbf{x}_i,\beta^* \rangle$ and $y_i$ is noisy, quantized to a single bit, potentially nonlinear, noninvertible, as well as unknown. This model is known as the single-index model in statistics, and, among other things, it represents a significant generalization of one-bit compressed sensing. We propose a novel spectral-based estimation procedure and show that we can recover $\beta^*$ in settings (i.e., classes of link function $f$) where previous algorithms fail. In general, our algorithm requires only very mild restrictions on the (unknown) functional relationship between $y_i$ and $\langle \mathbf{x}_i,\beta^* \rangle$. We also consider the high dimensional setting where $\beta^*$ is sparse ,and introduce a two-stage nonconvex framework that addresses estimation challenges in high dimensional regimes where $p \gg n$. For a broad class of link functions between $\langle \mathbf{x}_i,\beta^* \rangle$ and $y_i$, we establish minimax lower bounds that demonstrate the optimality of our estimators in both the classical and high dimensional regimes.

artificial intelligence, machine learning, regression, (18 more...)

arXiv.org Machine Learning

1505.03257

Country: North America > United States > Texas (0.28)

Genre: Research Report (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)

Add feedback

DART: Dropouts meet Multiple Additive Regression Trees

Rashmi, K. V., Gilad-Bachrach, Ran

arXiv.org Machine LearningMay-7-2015

Multiple Additive Regression Trees (MART), an ensemble model of boosted regression trees, is known to deliver high prediction accuracy for diverse tasks, and it is widely used in practice. However, it suffers an issue which we call over-specialization, wherein trees added at later iterations tend to impact the prediction of only a few instances, and make negligible contribution towards the remaining instances. This negatively affects the performance of the model on unseen data, and also makes the model over-sensitive to the contributions of the few, initially added tress. We show that the commonly used tool to address this issue, that of shrinkage, alleviates the problem only to a certain extent and the fundamental issue of over-specialization still remains. In this work, we explore a different approach to address the problem that of employing dropouts, a tool that has been recently proposed in the context of learning deep neural networks. We propose a novel way of employing dropouts in MART, resulting in the DART algorithm. We evaluate DART on ranking, regression and classification tasks, using large scale, publicly available datasets, and show that DART outperforms MART in each of the tasks, with a significant margin. We also show that DART overcomes the issue of over-specialization to a considerable extent.

artificial intelligence, ensemble, machine learning, (20 more...)

arXiv.org Machine Learning

1505.01866

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.85)

Add feedback

On the Feasibility of Distributed Kernel Regression for Big Data

Xu, Chen, Zhang, Yongquan, Li, Runze

arXiv.org Machine LearningMay-4-2015

In modern scientific research, massive datasets with huge numbers of observations are frequently encountered. To facilitate the computational process, a divide-and-conquer scheme is often used for the analysis of big data. In such a strategy, a full dataset is first split into several manageable segments; the final output is then averaged from the individual outputs of the segments. Despite its popularity in practice, it remains largely unknown that whether such a distributive strategy provides valid theoretical inferences to the original data. In this paper, we address this fundamental issue for the distributed kernel regression (DKR), where the algorithmic feasibility is measured by the generalization performance of the resulting estimator. To justify DKR, a uniform convergence rate is needed for bounding the generalization error over the individual outputs, which brings new and challenging issues in the big data setup. Under mild conditions, we show that, with a proper number of segments, DKR leads to an estimator that is generalization consistent to the unknown regression function. The obtained results justify the method of DKR and shed light on the feasibility of using other distributed algorithms for processing big data. The promising preference of the method is supported by both simulation and real data examples.

data mining, dkr, machine learning, (17 more...)

arXiv.org Machine Learning

1505.00869

Country:

Asia (0.68)
North America > United States > Pennsylvania (0.28)

Genre: Research Report (0.83)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)

Add feedback

Regularization-free estimation in trace regression with symmetric positive semidefinite matrices

Slawski, Martin, Li, Ping, Hein, Matthias

arXiv.org Machine LearningApr-23-2015

Over the past few years, trace regression models have received considerable attention in the context of matrix completion, quantum state tomography, and compressed sensing. Estimation of the underlying matrix from regularization-based approaches promoting low-rankedness, notably nuclear norm regularization, have enjoyed great popularity. In the present paper, we argue that such regularization may no longer be necessary if the underlying matrix is symmetric positive semidefinite (\textsf{spd}) and the design satisfies certain conditions. In this situation, simple least squares estimation subject to an \textsf{spd} constraint may perform as well as regularization-based approaches with a proper choice of the regularization parameter, which entails knowledge of the noise level and/or tuning. By contrast, constrained least squares estimation comes without any tuning parameter and may hence be preferred due to its simplicity.

artificial intelligence, machine learning, matrix, (17 more...)

arXiv.org Machine Learning

1504.06305

Country:

North America > United States (0.46)
Europe > Germany (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Add feedback

On the relation between Gaussian process quadratures and sigma-point methods

Särkkä, Simo, Hartikainen, Jouni, Svensson, Lennart, Sandblom, Fredrik

arXiv.org Machine LearningApr-22-2015

This article is concerned with Gaussian process quadratures, which are numerical integration methods based on Gaussian process regression methods, and sigma-point methods, which are used in advanced non-linear Kalman filtering and smoothing algorithms. We show that many sigma-point methods can be interpreted as Gaussian quadrature based methods with suitably selected covariance functions. We show that this interpretation also extends to more general multivariate Gauss--Hermite integration methods and related spherical cubature rules. Additionally, we discuss different criteria for selecting the sigma-point locations: exactness for multivariate polynomials up to a given order, minimum average error, and quasi-random point sets. The performance of the different methods is tested in numerical experiments.

artificial intelligence, covariance function, machine learning, (14 more...)

arXiv.org Machine Learning

1504.05994

Country:

Europe > Finland (0.28)
North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Modeling & Simulation (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Add feedback

Spectral Norm of Random Kernel Matrices with Applications to Privacy

Kasiviswanathan, Shiva Prasad, Rudelson, Mark

arXiv.org Machine LearningApr-22-2015

Kernel methods are an extremely popular set of techniques used for many important machine learning and data analysis applications. In addition to having good practical performances, these methods are supported by a well-developed theory. Kernel methods use an implicit mapping of the input data into a high dimensional feature space defined by a kernel function, i.e., a function returning the inner product between the images of two data points in the feature space. Central to any kernel method is the kernel matrix, which is built by evaluating the kernel function on a given sample dataset. In this paper, we initiate the study of non-asymptotic spectral theory of random kernel matrices. These are n x n random matrices whose (i,j)th entry is obtained by evaluating the kernel function on $x_i$ and $x_j$, where $x_1,...,x_n$ are a set of n independent random high-dimensional vectors. Our main contribution is to obtain tight upper bounds on the spectral norm (largest eigenvalue) of random kernel matrices constructed by commonly used kernel functions based on polynomials and Gaussian radial basis. As an application of these results, we provide lower bounds on the distortion needed for releasing the coefficients of kernel ridge regression under attribute privacy, a general privacy notion which captures a large class of privacy definitions. Kernel ridge regression is standard method for performing non-parametric regression that regularly outperforms traditional regression approaches in various domains. Our privacy distortion lower bounds are the first for any kernel technique, and our analysis assumes realistic scenarios for the input, unlike all previous lower bounds for other release problems which only hold under very restrictive input settings.

artificial intelligence, machine learning, matrix, (19 more...)

arXiv.org Machine Learning

1504.0588

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Kernel Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.89)

Add feedback

High-Dimensional Classification for Brain Decoding

Croteau, Nicole, Nathoo, Farouk S., Cao, Jiguo, Budney, Ryan

arXiv.org Machine LearningApr-10-2015

Brain decoding involves the determination of a subject's cognitive state or an associated stimulus from functional neuroimaging data measuring brain activity. In this setting the cognitive state is typically characterized by an element of a finite set, and the neuroimaging data comprise voluminous amounts of spatiotemporal data measuring some aspect of the neural signal. The associated statistical problem is one of classification from high-dimensional data. We explore the use of functional principal component analysis, mutual information networks, and persistent homology for examining the data through exploratory analysis and for constructing features characterizing the neural signal for brain decoding. We review each approach from this perspective, and we incorporate the features into a classifier based on symmetric multinomial logistic regression with elastic net regularization. The approaches are illustrated in an application where the task is to infer, from brain activity measured with magnetoencephalography (MEG), the type of video stimulus shown to a subject.

artificial intelligence, classifier, machine learning, (18 more...)

arXiv.org Machine Learning

1504.028

Country: North America (0.28)

Genre:

Research Report > Experimental Study (0.49)
Research Report > New Finding (0.35)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)

Add feedback