to

### Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT

We provide non-asymptotic convergence rates of the Polyak-Ruppert averaged stochastic gradient descent (SGD) to a normal random vector for a class of twice-differentiable test functions. A crucial intermediate step is proving a non-asymptotic martingale central limit theorem (CLT), i.e., establishing the rates of convergence of a multivariate martingale difference sequence to a normal random vector, which might be of independent interest. We obtain the explicit rates for the multivariate martingale CLT using a combination of Stein's method and Lindeberg's argument, which is then used in conjunction with a non-asymptotic analysis of averaged SGD proposed in [PJ92]. Our results have potentially interesting consequences for computing confidence intervals for parameter estimation with SGD and constructing hypothesis tests with SGD that are valid in a non-asymptotic sense.

### Retrain or not retrain: Conformal test martingales for change-point detection

The standard assumption in mainstream machine learning is that the observed data are IID (independent and identically distributed); we will refer to it as the IID assumption. Deviations from the IID assumption are known as dataset shift, and different kinds of dataset shift have become a popular topic of research (see, e.g., Quiñonero-Candela et al. (2009)). Testing the IID assumption has been a popular topic in statistics (see, e.g., Lehmann (2006), Chapter 7), but the mainstream work in statistics concentrates on the batch setting with each observation being a real number. In the context of deciding whether a prediction algorithm needs to be retrained, it is more important to process data online, so that at each point in time we have an idea of the degree to which the IID assumption has been discredited. It is also important that the observations are not just real numbers; in the context of machine learning the most important case is where each observation is a pair (x, y) consisting of a sample x (such as an image) and its label y. The existing work on detecting dataset shift in machine learning (see, e.g., Harel et al. (2014) and its literature review) does not have these shortcomings but does not test the IID assumption directly.

### Testing for concept shift online

The most standard way of testing statistical hypotheses is batch testing: we try to reject a given null hypothesis based on a batch of data. The alternative approach of online testing (see, e.g., [10] or [9]) consists in constructing a nonnegative process that is a martingale under the null hypothesis. The ratio of the current value of such a process to its initial value can be interpreted as the amount of evidence found against the null hypothesis. The standard assumption in machine learning is the (general) IID assumption, sometimes referred to (especially in older literature) as the assumption of randomness: the observations are assumed to be independent and identically distributed, but nothing is assumed about the probability measure generating a single observation. Interestingly, there exist processes, exchangeability martingales, that are martingales under the IID assumption; they can be constructed (see, e.g., [14, Section 7.1] or [13]) using the method of conformal prediction [14, Chapter 2]. Deviations from the IID assumption have become a popular topic of research in machine learning under the name of dataset shift [6, 7]; in my terminology I will follow mostly [6].

### On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

We study an equivalence of (i) deterministic pathwise statements appearing in the online learning literature (termed \emph{regret bounds}), (ii) high-probability tail bounds for the supremum of a collection of martingales (of a specific form arising from uniform laws of large numbers for martingales), and (iii) in-expectation bounds for the supremum. By virtue of the equivalence, we prove exponential tail bounds for norms of Banach space valued martingales via deterministic regret bounds for the online mirror descent algorithm with an adaptive step size. We extend these results beyond the linear structure of the Banach space: we define a notion of \emph{martingale type} for general classes of real-valued functions and show its equivalence (up to a logarithmic factor) to various sequential complexities of the class (in particular, the sequential Rademacher complexity and its offset version). For classes with the general martingale type 2, we exhibit a finer notion of variation that allows partial adaptation to the function indexing the martingale. Our proof technique rests on sequential symmetrization and on certifying the \emph{existence} of regret minimization strategies for certain online prediction problems.

### Mixture Martingales Revisited with Applications to Sequential Tests and Confidence Intervals

This paper presents new deviation inequalities that are valid uniformly in time under adaptive sampling in a multi-armed bandit model. The deviations are measured using the Kullback-Leibler divergence in a given one-dimensional exponential family, and may take into account several arms at a time. They are obtained by constructing for each arm a mixture martingale based on a hierarchical prior, and by multiplying those martingales. Our deviation inequalities allow us to analyze stopping rules based on generalized likelihood ratios for a large class of sequential identification problems. We establish asymptotic optimality of sequential tests generalising the track-and-stop method to problems beyond best arm identification. We further derive sharper stopping thresholds, where the number of arms is replaced by the newly introduced pure exploration problem rank. We construct tight confidence intervals for linear functions and minima/maxima of the vector of arm means.