Goto

Collaborating Authors

 Mourtada, Jaouad


Finite-sample performance of the maximum likelihood estimator in logistic regression

arXiv.org Machine Learning

Logistic regression is a classical model for describing the probabilistic dependence of binary responses to multivariate covariates. We consider the predictive performance of the maximum likelihood estimator (MLE) for logistic regression, assessed in terms of logistic risk. We consider two questions: first, that of the existence of the MLE (which occurs when the dataset is not linearly separated), and second that of its accuracy when it exists. These properties depend on both the dimension of covariates and on the signal strength. In the case of Gaussian covariates and a well-specified logistic model, we obtain sharp non-asymptotic guarantees for the existence and excess logistic risk of the MLE. We then generalize these results in two ways: first, to non-Gaussian covariates satisfying a certain two-dimensional margin condition, and second to the general case of statistical learning with a possibly misspecified logistic model. Finally, we consider the case of a Bernoulli design, where the behavior of the MLE is highly sensitive to the parameter direction.


Local Risk Bounds for Statistical Aggregation

arXiv.org Artificial Intelligence

In the problem of aggregation, the aim is to combine a given class of base predictors to achieve predictions nearly as accurate as the best one. In this flexible framework, no assumption is made on the structure of the class or the nature of the target. Aggregation has been studied in both sequential and statistical contexts. Despite some important differences between the two problems, the classical results in both cases feature the same global complexity measure. In this paper, we revisit and tighten classical results in the theory of aggregation in the statistical setting by replacing the global complexity with a smaller, local one. Some of our proofs build on the PAC-Bayes localization technique introduced by Catoni. Among other results, we prove localized versions of the classical bound for the exponential weights estimator due to Leung and Barron and deviation-optimal bounds for the Q-aggregation estimator. These bounds improve over the results of Dai, Rigollet and Zhang for fixed design regression and the results of Lecu\'e and Rigollet for random design regression.


Universal coding, intrinsic volumes, and metric complexity

arXiv.org Machine Learning

We study sequential probability assignment in the Gaussian setting, where the goal is to predict, or equivalently compress, a sequence of real-valued observations almost as well as the best Gaussian distribution with mean constrained to a given subset of $\mathbf{R}^n$. First, in the case of a convex constraint set $K$, we express the hardness of the prediction problem (the minimax regret) in terms of the intrinsic volumes of $K$; specifically, it equals the logarithm of the Wills functional from convex geometry. We then establish a comparison inequality for the Wills functional in the general nonconvex case, which underlines the metric nature of this quantity and generalizes the Slepian-Sudakov-Fernique comparison principle for the Gaussian width. Motivated by this inequality, we characterize the exact order of magnitude of the considered functional for a general nonconvex set, in terms of global covering numbers and local Gaussian widths. This implies metric isomorphic estimates for the log-Laplace transform of the intrinsic volume sequence of a convex body. As part of our analysis, we also characterize the minimax redundancy for a general constraint set. We finally relate and contrast our findings with classical asymptotic results in information theory.


Distribution-Free Robust Linear Regression

arXiv.org Machine Learning

We study random design linear regression with no assumptions on the distribution of the covariates and with a heavy-tailed response variable. In this distribution-free regression setting, we show that boundedness of the conditional second moment of the response given the covariates is a necessary and sufficient condition for achieving nontrivial guarantees. As a starting point, we prove an optimal version of the classical in-expectation bound for the truncated least squares estimator due to Gy\"{o}rfi, Kohler, Krzy\.{z}ak, and Walk. However, we show that this procedure fails with constant probability for some distributions despite its optimal in-expectation performance. Then, combining the ideas of truncated least squares, median-of-means procedures, and aggregation theory, we construct a non-linear estimator achieving excess risk of order $d/n$ with an optimal sub-exponential tail. While existing approaches to linear regression for heavy-tailed distributions focus on proper estimators that return linear functions, we highlight that the improperness of our procedure is necessary for attaining nontrivial guarantees in the distribution-free setting.


Regularized ERM on random subspaces

arXiv.org Machine Learning

We study a natural extension of classical empirical risk minimization, where the hypothesis space is a random subspace of a given space. In particular, we consider possibly data dependent subspaces spanned by a random subset of the data, recovering as a special case Nystr\"om approaches for kernel methods. Considering random subspaces naturally leads to computational savings, but the question is whether the corresponding learning accuracy is degraded. These statistical-computational tradeoffs have been recently explored for the least squares loss and self-concordant loss functions, such as the logistic loss. Here, we work to extend these results to convex Lipschitz loss functions, that might not be smooth, such as the hinge loss used in support vector machines. This extension requires developing new proofs, that use different technical tools. Our main results show the existence of different settings, depending on how hard the learning problem is, for which computational efficiency can be improved with no loss in performance. Theoretical results are illustrated with simple numerical experiments.


Asymptotics of Ridge(less) Regression under General Source Condition

arXiv.org Machine Learning

Understanding the generalisation properties of Artificial Deep Neural Networks (ANN) has recently motivated a number of statistical questions. These models perform well in practice despite perfectly fitting (interpolating) the data, a property that seems at odds with classical statistical theory [49]. This has motivated the investigation of the generalisation performance of methods that achieve zero training error (interpolators) [32, 9, 11, 10, 8] and, in the context of linear least squares, the unique least norm solution to which gradient descent converges [22, 5, 37, 8, 21, 38, 20, 39]. Overparameterized linear models, where the number of variables exceed the number of points, are arguably the simplest and most natural setting where interpolation can be studied. Moreover, in certain regimes ANN can be approximated by suitable linear models [24, 17, 18, 2, 13]. The learning curve (test error versus model capacity) for interpolators has been shown to exhibit a characteristic "Double Descent" [1, 7] shape, where the test error decreases after peaking at the "interpolating" threshold, that is, the model capacity required to interpolate the data. The regime beyond this threshold naturally captures the settings of ANN [49], and thus, has motivated its investigation [36, 44, 39].


AMF: Aggregated Mondrian Forests for Online Learning

arXiv.org Machine Learning

Introduced by Breiman (2001), Random Forests (RF) is one of the algorithms of choice in many supervised learning applications. The appeal of these methods comes from their remarkable accuracy in a variety of tasks, the small number (or even the absence) of parameters to tune, their reasonable computational cost at training and prediction time, and their suitability in highdimensional settings. Most commonly used RF algorithms, such as the original random forest procedure (Breiman, 2001), extra-trees (Geurts et al., 2006), or conditional inference forest (Hothorn et al., 2010) are batch algorithms, that require the whole dataset to be available at once. Several online random forests variants have been proposed to overcome this issue and handle data that come sequentially. Utgoff (1989) was the first to extend Quinlan's ID3 batch decision tree algorithm (see Quinlan, 1986) to an online setting. Later on, Domingos and Hulten (2000) introduce Hoeffding Trees that can be easily updated: since observations are available sequentially, a cell is split when (i) enough observations have fallen into this cell, (ii) the best split in the cell is statistically relevant (a generic Hoeffding inequality being used to assess the quality of the best split). Since random forests are known to exhibit better empirical performances than individual decision trees, online random forests have been proposed (see, e.g., Saffari et al., 2009; Denil et al., 2013).


Anytime Hedge achieves optimal regret in the stochastic regime

arXiv.org Machine Learning

This paper is about a surprising fact: we prove that the anytime Hedge algorithm with decreasing learning rate, which is one of the simplest algorithm for the problem of prediction with expert advice, is actually both worst-case optimal and adaptive to the easier stochastic and adversarial with a gap problems. This runs counter to the common belief in the literature that this algorithm is overly conservative, and that only new adaptive algorithms can simultaneously achieve minimax regret and adapt to the difficulty of the problem. Moreover, our analysis exhibits qualitative differences with other variants of the Hedge algorithm, based on the so-called "doubling trick", and the fixed-horizon version (with constant learning rate).


Minimax optimal rates for Mondrian trees and forests

arXiv.org Machine Learning

Introduced by Breiman (2001), Random Forests are widely used as classification and regression algorithms. While being initially designed as batch algorithms, several variants have been proposed to handle online learning. One particular instance of such forests is the Mondrian Forest, whose trees are built using the so-called Mondrian process, therefore allowing to easily update their construction in a streaming fashion. In this paper, we study Mondrian Forests in a batch setting and prove their consistency assuming a proper tuning of the lifetime sequence. A thorough theoretical study of Mondrian partitions allows us to derive an upper bound for the risk of Mondrian Forests, which turns out to be the minimax optimal rate for both Lipschitz and twice differentiable regression functions. These results are actually the first to state that some particular random forests achieve minimax rates \textit{in arbitrary dimension}, paving the way to a refined theoretical analysis and thus a deeper understanding of these black box algorithms.


Universal consistency and minimax rates for online Mondrian Forests

Neural Information Processing Systems

We establish the consistency of an algorithm of Mondrian Forests [LRT14, LRT16], a randomized classification algorithm that can be implemented online. First, we amend the original Mondrian Forest algorithm proposed in [LRT14], that considers a fixed lifetime parameter. Indeed, the fact that this parameter is fixed hinders the statistical consistency of the original procedure.