test martingale
Testing For Distribution Shifts with Conditional Conformal Test Martingales
Shaer, Shalev, Bar, Yarin, Prinster, Drew, Romano, Yaniv
We propose a sequential test for detecting arbitrary distribution shifts that allows conformal test martingales (CTMs) to work under a fixed, reference-conditional setting. Existing CTM detectors construct test martingales by continually growing a reference set with each incoming sample, using it to assess how atypical the new sample is relative to past observations. While this design yields anytime-valid type-I error control, it suffers from test-time contamination: after a change, post-shift observations enter the reference set and dilute the evidence for distribution shift, increasing detection delay and reducing power. In contrast, our method avoids contamination by design by comparing each new sample to a fixed null reference dataset. Our main technical contribution is a robust martingale construction that remains valid conditional on the null reference data, achieved by explicitly accounting for the estimation error in the reference distribution induced by the finite reference set. This yields anytime-valid type-I error control together with guarantees of asymptotic power one and bounded expected detection delay. Empirically, our method detects shifts faster than standard CTMs, providing a powerful and reliable distribution-shift detector.
- Asia > Middle East > Israel (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- Asia > Middle East > Jordan (0.04)
WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales
Prinster, Drew, Han, Xing, Liu, Anqi, Saria, Suchi
Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but also continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Methods for nonparametric sequential testing -- especially conformal test martingales (CTMs) and anytime-valid inference -- offer promising tools for this monitoring task. However, existing approaches are restricted to monitoring limited hypothesis classes or ``alarm criteria'' (e.g., detecting data shifts that violate certain exchangeability or IID assumptions), do not allow for online adaptation in response to shifts, and/or cannot diagnose the cause of degradation or alarm. In this paper, we address these limitations by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that adapt online to mild covariate shifts (in the marginal input distribution), quickly detect harmful shifts, and diagnose those harmful shifts as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > Canada (0.04)
- (2 more...)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.31)
Combining Evidence Across Filtrations
Choe, Yo Joong, Ramdas, Aaditya
In anytime-valid sequential inference, it is known that any admissible inference procedure must be based on test martingales and their composite generalization, called e-processes, which are nonnegative processes whose expectation at any arbitrary stopping time is upper-bounded by one. An e-process quantifies the accumulated evidence against a composite null hypothesis over a sequence of outcomes. This paper studies methods for combining e-processes that are computed using different information sets, i.e., filtrations, for a null hypothesis. Even though e-processes constructed on the same filtration can be combined effortlessly (e.g., by averaging), e-processes constructed on different filtrations cannot be combined as easily because their validity in a coarser filtration does not translate to validity in a finer filtration. We discuss three concrete examples of such e-processes in the literature: exchangeability tests, independence tests, and tests for evaluating and comparing forecasts with lags. Our main result establishes that these e-processes can be lifted into any finer filtration using adjusters, which are functions that allow betting on the running maximum of the accumulated wealth (thereby insuring against the loss of evidence). We also develop randomized adjusters that can improve the power of the resulting sequential inference procedure.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Safe Testing
Grünwald, Peter, de Heide, Rianne, Koolen, Wouter
We develop the theory of hypothesis testing based on the e-value, a notion of evidence that, unlike the p-value, allows for effortlessly combining results from several studies in the common scenario where the decision to perform a new study may depend on previous outcomes. Tests based on e-values are safe, i.e. they preserve Type-I error guarantees, under such optional continuation. We define growth-rate optimality (GRO) as an analogue of power in an optional continuation context, and we show how to construct GRO e-variables for general testing problems with composite null and alternative, emphasizing models with nuisance parameters. GRO e-values take the form of Bayes factors with special priors. We illustrate the theory using several classic examples including a one-sample safe t-test and the 2 x 2 contingency table. Sharing Fisherian, Neymanian and Jeffreys-Bayesian interpretations, e-values may provide a methodology acceptable to adherents of all three schools.
- North America > United States > New York (0.04)
- Europe > Netherlands > South Holland > Leiden (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (5 more...)
Retrain or not retrain: Conformal test martingales for change-point detection
Vovk, Vladimir, Petej, Ivan, Nouretdinov, Ilia, Ahlberg, Ernst, Carlsson, Lars, Gammerman, Alex
The standard assumption in mainstream machine learning is that the observed data are IID (independent and identically distributed); we will refer to it as the IID assumption. Deviations from the IID assumption are known as dataset shift, and different kinds of dataset shift have become a popular topic of research (see, e.g., Quiñonero-Candela et al. (2009)). Testing the IID assumption has been a popular topic in statistics (see, e.g., Lehmann (2006), Chapter 7), but the mainstream work in statistics concentrates on the batch setting with each observation being a real number. In the context of deciding whether a prediction algorithm needs to be retrained, it is more important to process data online, so that at each point in time we have an idea of the degree to which the IID assumption has been discredited. It is also important that the observations are not just real numbers; in the context of machine learning the most important case is where each observation is a pair (x, y) consisting of a sample x (such as an image) and its label y. The existing work on detecting dataset shift in machine learning (see, e.g., Harel et al. (2014) and its literature review) does not have these shortcomings but does not test the IID assumption directly.
- North America > United States > New York (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Workflow (0.68)
- Research Report (0.65)
Mixture Martingales Revisited with Applications to Sequential Tests and Confidence Intervals
Kaufmann, Emilie, Koolen, Wouter
This paper presents new deviation inequalities that are valid uniformly in time under adaptive sampling in a multi-armed bandit model. The deviations are measured using the Kullback-Leibler divergence in a given one-dimensional exponential family, and may take into account several arms at a time. They are obtained by constructing for each arm a mixture martingale based on a hierarchical prior, and by multiplying those martingales. Our deviation inequalities allow us to analyze stopping rules based on generalized likelihood ratios for a large class of sequential identification problems. We establish asymptotic optimality of sequential tests generalising the track-and-stop method to problems beyond best arm identification. We further derive sharper stopping thresholds, where the number of arms is replaced by the newly introduced pure exploration problem rank. We construct tight confidence intervals for linear functions and minima/maxima of the vector of arm means.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)