AITopics

1312.1666

Country: Europe > United Kingdom (0.28)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Rosasco, Lorenzo, Villa, Silvia

Learning with incremental iterative regularization

Machine learning applications often require efficient statistical procedures to process potentially massive amount of high dimensional data. Motivated by such applications, the broad objective of our study is deriving learning procedures with optimal statistical properties, and, at the same time, computational complexities proportional to the generalization properties allowed by the data, rather than their raw amount [5]. In this paper, we focus on iterative regularization as a viable approach towards this goal. The key observation behind these techniques is that iterative optimization schemes applied to scattered, noisy data exhibit a self-regularizing property, in the sense that early termination (early-stop) of the iterative process has a regularizing effect [19, 22]. Indeed, iterative regularization algorithms are classical in inverse problems [14], and have been recently considered in machine learning [6, 32, 2, 4, 8, 24], where they have been proved to achieve optimal learning bounds, matching those of variational regularization schemes such as Tikhonov [7, 29]. 1 In this paper, we consider an iterative regularization algorithm for the square loss, based on a recursive procedure updating the solution after processing one training set point at each iteration. Methods of the latter form, often broadly referred to as online learning algorithms, have become standard in the processing of large data-sets, because of their often low iteration cost and good practical performance. Theoretical studies for this class of algorithms have been developed within different frameworks.

artificial intelligence, iteration, machine learning, (16 more...)

1405.0042

Country: North America > United States > Massachusetts (0.28)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Chwialkowski, Kacper, Ramdas, Aaditya, Sejdinovic, Dino, Gretton, Arthur

Fast Two-Sample Testing with Analytic Representations of Probability Measures

We propose a class of nonparametric two-sample tests with a cost linear in the sample size. Two tests are given, both based on an ensemble of distances between analytic functions representing each of the distributions. The first test uses smoothed empirical characteristic functions to represent the distributions, the second uses distribution embeddings in a reproducing kernel Hilbert space. Analyticity implies that differences in the distributions may be detected almost surely at a finite number of randomly chosen locations/frequencies. The new tests are consistent against a larger class of alternatives than the previous linear-time tests based on the (non-smoothed) empirical characteristic functions, while being much faster than the current state-of-the-art quadratic-time kernel-based or energy distance-based tests. Experiments on artificial benchmarks and on challenging real-world testing problems demonstrate that our tests give a better power/time tradeoff than competing approaches, and in some cases, better outright power than even the most expensive quadratic-time tests. This performance advantage is retained even in high dimensions, and in cases where the difference in distributions is not observable with low order statistics.

artificial intelligence, characteristic function, machine learning, (17 more...)

1506.04725

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Alquier, Pierre, Ridgway, James, Chopin, Nicolas

On the properties of variational approximations of Gibbs posteriors

The PAC-Bayesian approach is a powerful set of techniques to derive non- asymptotic risk bounds for random estimators. The corresponding optimal distribution of estimators, usually called the Gibbs posterior, is unfortunately intractable. One may sample from it using Markov chain Monte Carlo, but this is often too slow for big datasets. We consider instead variational approximations of the Gibbs posterior, which are fast to compute. We undertake a general study of the properties of such approximations. Our main finding is that such a variational approximation has often the same rate of convergence as the original PAC-Bayesian procedure it approximates. We specialise our results to several learning tasks (classification, ranking, matrix completion),discuss how to implement a variational approximation in each case, and illustrate the good properties of said approximation on real datasets.

approximation, artificial intelligence, machine learning, (19 more...)

1506.04091

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Laviolette, François, Morvant, Emilie, Ralaivola, Liva, Roy, Jean-Francis

On the Generalization of the C-Bound to Structured Output Ensemble Methods

It is well-known that learning predictive models capable of dealing with outputs that are richer than binary outputs (e.g., multiclass or multilabel) and for which theoretical guarantees exist is still a realm of intensive investigations. From a practical standpoint, a lot of relaxations for learning with complex outputs have been devised. A common approach consists in decomposing the output space into "simpler" spaces so that the learning problem at hand can be reduced to a few easier (i.e., binary) learning tasks. For instance, this is the idea spurred by the Error-Correcting Output Codes (Dietterich & Bakiri, 1995) that makes possible to reduce multiclass or multilabel problems into binary classification tasks,e.g., (Allwein et al., 2001; Mroueh et al., 2012; Read et al., 2011; Tsoumakas & Vlahavas, 2007; Zhang & Schneider, 2012). In our work, we study the problem of complex output prediction by focusing on prediction functions that take the form of a weighted majority vote over a set of complex output classifiers (or voters). Recall that ensemble methods can all be seen as majority vote learning procedures (Dietterich, 2000; Re & Valentini, 2012). Methods such as Bagging (Breiman, 1996), Boosting (Schapire & Singer, 1999) and Random Forests (Breiman, 2001) are representative voting methods. Cortes et al. (2014) have proposed various ensemble methods for the structured output prediction framework. Note also that majority votes are also central to the Bayesian approach (Gelman et al., 2004) with the notion of Bayesian model averaging (Domingos, 2000; Haussler et al., 1994) and most of kernel-based predictors, such as the Support Vector Machines (Boser et al., 1992; Cortes & Vapnik, 1995) may be viewed as weighted majority votes as well: for binary classification, where the predicted class for some input x is computed as the sign of

artificial intelligence, machine learning, majority vote, (16 more...)

1408.1336

Country: North America (0.28)

Genre: Research Report (0.50)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.54)

arXiv.org Machine LearningJun-13-2015

A General Framework for Fast Stagewise Algorithms

Tibshirani, Ryan J.

Forward stagewise regression follows a very simple strategy for constructing a sequence of sparse regression estimates: it starts with all coefficients equal to zero, and iteratively updates the coefficient (by a small amount $\epsilon$) of the variable that achieves the maximal absolute inner product with the current residual. This procedure has an interesting connection to the lasso: under some conditions, it is known that the sequence of forward stagewise estimates exactly coincides with the lasso path, as the step size $\epsilon$ goes to zero. Furthermore, essentially the same equivalence holds outside of least squares regression, with the minimization of a differentiable convex loss function subject to an $\ell_1$ norm constraint (the stagewise algorithm now updates the coefficient corresponding to the maximal absolute component of the gradient). Even when they do not match their $\ell_1$-constrained analogues, stagewise estimates provide a useful approximation, and are computationally appealing. Their success in sparse modeling motivates the question: can a simple, effective strategy like forward stagewise be applied more broadly in other regularization settings, beyond the $\ell_1$ norm and sparsity? The current paper is an attempt to do just this. We present a general framework for stagewise estimation, which yields fast algorithms for problems such as group-structured learning, matrix completion, image denoising, and more.

algorithm, artificial intelligence, machine learning, (17 more...)

1408.5801

Country: North America > United States (0.67)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)

Rosenblatt, Jonathan, Nadler, Boaz

On the Optimality of Averaging in Distributed Statistical Learning

arXiv.org Machine LearningJun-13-2015

A common approach to statistical learning with big-data is to randomly split it among $m$ machines and learn the parameter of interest by averaging the $m$ individual estimates. In this paper, focusing on empirical risk minimization, or equivalently M-estimation, we study the statistical error incurred by this strategy. We consider two large-sample settings: First, a classical setting where the number of parameters $p$ is fixed, and the number of samples per machine $n\to\infty$. Second, a high-dimensional regime where both $p,n\to\infty$ with $p/n \to \kappa \in (0,1)$. For both regimes and under suitable assumptions, we present asymptotically exact expressions for this estimation error. In the fixed-$p$ setting, under suitable assumptions, we prove that to leading order averaging is as accurate as the centralized solution. We also derive the second order error terms, and show that these can be non-negligible, notably for non-linear models. The high-dimensional setting, in contrast, exhibits a qualitatively different behavior: data splitting incurs a first-order accuracy loss, which to leading order increases linearly with the number of machines. The dependence of our error approximations on the number of machines traces an interesting accuracy-complexity tradeoff, allowing the practitioner an informed choice on the number of machines to deploy. Finally, we confirm our theoretical analysis with several simulations.

artificial intelligence, assumption, machine learning, (19 more...)

doi: 10.1093/imaiai/iaw013

1407.2724

Country: North America > United States (0.93)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Teh, Yee Whye, Thiéry, Alexandre, Vollmer, Sebastian

Consistency and fluctuations for stochastic gradient Langevin dynamics

arXiv.org Machine LearningJun-12-2015

Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data sets is computationally expensive. Both the calculation of the acceptance probability and the creation of informed proposals usually require an iteration through the whole data set. The recently proposed stochastic gradient Langevin dynamics (SGLD) method circumvents this problem by generating proposals which are only based on a subset of the data, by skipping the accept-reject step and by using decreasing step-sizes sequence $(\delta_m)_{m \geq 0}$. %Under appropriate Lyapunov conditions, We provide in this article a rigorous mathematical framework for analysing this algorithm. We prove that, under verifiable assumptions, the algorithm is consistent, satisfies a central limit theorem (CLT) and its asymptotic bias-variance decomposition can be characterized by an explicit functional of the step-sizes sequence $(\delta_m)_{m \geq 0}$. We leverage this analysis to give practical recommendations for the notoriously difficult tuning of this algorithm: it is asymptotically optimal to use a step-size sequence of the type $\delta_m \asymp m^{-1/3}$, leading to an algorithm whose mean squared error (MSE) decreases at rate $\mathcal{O}(m^{-1/3})$

artificial intelligence, bayesian inference, machine learning, (16 more...)

1409.0578

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

De Myttenaere, Arnaud, Golden, Boris, Grand, Bénédicte Le, Rossi, Fabrice

Using the Mean Absolute Percentage Error for Regression Models

arXiv.org Machine LearningJun-12-2015

We study in this paper the consequences of using the Mean Absolute Percentage Error (MAPE) as a measure of quality for regression models. We show that finding the best model under the MAPE is equivalent to doing weighted Mean Absolute Error (MAE) regression. We show that universal consistency of Empirical Risk Minimization remains possible using the MAPE instead of the MAE.

artificial intelligence, machine learning, mape, (15 more...)

1506.04176

Country: Europe > France (0.16)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.62)

Corneli, Marco, Latouche, Pierre, Rossi, Fabrice

Exact ICL maximization in a non-stationary time extension of the latent block model for dynamic networks

arXiv.org Machine LearningJun-12-2015

The latent block model (LBM) is a flexible probabilistic tool to describe interactions between node sets in bipartite networks, but it does not account for interactions of time varying intensity between nodes in unknown classes. In this paper we propose a non stationary temporal extension of the LBM that clusters simultaneously the two node sets of a bipartite network and constructs classes of time intervals on which interactions are stationary. The number of clusters as well as the membership to classes are obtained by maximizing the exact complete-data integrated likelihood relying on a greedy search approach. Experiments on simulated and real data are carried out in order to assess the proposed methodology.

artificial intelligence, interaction, machine learning, (13 more...)

1506.04138

Country: Europe > Belgium (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.35)