Collaborating Authors

Adaptive Martingale Boosting

Neural Information Processing Systems

In recent work Long and Servedio LS05short presented a martingale boosting'' algorithm that works by constructing a branching program over weak classifiers and has a simple analysis based on elementary properties of random walks. LS05short showed that this martingale booster can tolerate random classification noise when it is run with a noise-tolerant weak learner; however, a drawback of the algorithm is that it is not adaptive, i.e. it cannot effectively take advantage of variation in the quality of the weak classifiers it receives. In this paper we present a variant of the original martingale boosting algorithm and prove that it is adaptive. This adaptiveness is achieved by modifying the original algorithm so that the random walks that arise in its analysis have different step size depending on the quality of the weak learner at each stage. The new algorithm inherits the desirable properties of the original LS05short algorithm, such as random classification noise tolerance, and has several other advantages besides adaptiveness: it requires polynomially fewer calls to the weak learner than the original algorithm, and it can be used with confidence-rated weak hypotheses that output real values rather than Boolean predictions.

Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT Machine Learning

We provide non-asymptotic convergence rates of the Polyak-Ruppert averaged stochastic gradient descent (SGD) to a normal random vector for a class of twice-differentiable test functions. A crucial intermediate step is proving a non-asymptotic martingale central limit theorem (CLT), i.e., establishing the rates of convergence of a multivariate martingale difference sequence to a normal random vector, which might be of independent interest. We obtain the explicit rates for the multivariate martingale CLT using a combination of Stein's method and Lindeberg's argument, which is then used in conjunction with a non-asymptotic analysis of averaged SGD proposed in [PJ92]. Our results have potentially interesting consequences for computing confidence intervals for parameter estimation with SGD and constructing hypothesis tests with SGD that are valid in a non-asymptotic sense.

PAC-Bayesian Inequalities for Martingales Machine Learning

We present a set of high-probability inequalities that control the concentration of weighted averages of multiple (possibly uncountably many) simultaneously evolving and interdependent martingales. Our results extend the PAC-Bayesian analysis in learning theory from the i.i.d. setting to martingales opening the way for its application to importance weighted sampling, reinforcement learning, and other interactive learning domains, as well as many other domains in probability theory and statistics, where martingales are encountered. We also present a comparison inequality that bounds the expectation of a convex function of a martingale difference sequence shifted to the [0,1] interval by the expectation of the same function of independent Bernoulli variables. This inequality is applied to derive a tighter analog of Hoeffding-Azuma's inequality.

Mixture Martingales Revisited with Applications to Sequential Tests and Confidence Intervals Machine Learning

This paper presents new deviation inequalities that are valid uniformly in time under adaptive sampling in a multi-armed bandit model. The deviations are measured using the Kullback-Leibler divergence in a given one-dimensional exponential family, and may take into account several arms at a time. They are obtained by constructing for each arm a mixture martingale based on a hierarchical prior, and by multiplying those martingales. Our deviation inequalities allow us to analyze stopping rules based on generalized likelihood ratios for a large class of sequential identification problems. We establish asymptotic optimality of sequential tests generalising the track-and-stop method to problems beyond best arm identification. We further derive sharper stopping thresholds, where the number of arms is replaced by the newly introduced pure exploration problem rank. We construct tight confidence intervals for linear functions and minima/maxima of the vector of arm means.

Power and limitations of conformal martingales Machine Learning

A standard assumption in machine learning has been the assumption that the data are generated in the IID fashion, i.e., independently from the same distribution. This assumption is also known as the assumption of randomness (see, e.g., [11, Section 7.1] and [27]). In this paper we are interested in testing this assumption. Conformal martingales are constructed on top of conventional machinelearning algorithms and have been used as a means of detecting deviations from randomness both in theoretical work (see, e.g., [27, Section 7.1], [4], [3]) and in practice (in the framework of the Microsoft Azure module on time series anomaly detection [28]). They provide an online measure of the amount of evidence found against the hypothesis of randomness and can be said to perform conformal change detection: if the assumption of randomness is satisfied, a fixed nonnegative conformal martingale with a positive initial value is not expected to increase its initial value manyfold; on the other hand, if the hypothesis of randomness is violated, a properly designed nonnegative conformal martingale with a positive initial value can be expected to increase its value substantially. Correspondingly, we have two desiderata for such a martingale S: - Validity is satisfied automatically: S is not expected to ever increase its initial value by much, under the hypothesis of randomness.