Vovk, Vladimir
Set and functional prediction: randomness, exchangeability, and conformal
Vovk, Vladimir
Conformal prediction is usually presented as a method of set prediction [10, Part I], i.e., as a way of producing prediction sets (rather than pointpredictions). Another way to look at a conformal predictor is as a way of producin g a p-value function (discussed, in a slightly different context, in, e.g., [4]), which is a function mapping each possible label y of a test object to the corresponding conformal p-value. In analogy with "prediction sets", we will call su ch p-value functions "prediction functions".
Randomness, exchangeability, and conformal prediction
Vovk, Vladimir
This paper continues development of the functional theory of randomness, a modification of the algorithmic theory of randomness getting rid of unspecified additive constants. It introduces new kinds of confidence predictors, including randomness predictors (the most general confidence predictors based on the assumption of IID observations) and exchangeability predictors (the most general confidence predictors based on the assumption of exchangeable observations). The main result implies that both are close to conformal predictors and quantifies the difference between randomness prediction and conformal prediction.
Logic of subjective probability
Vovk, Vladimir
In this paper I discuss both syntax and semantics of subjective probability. The semantics determines ways of testing probability statements. Among important varieties of subjective probabilities are intersubjective probabilities and impersonal probabilities, and I will argue that well-tested impersonal probabilities acquire features of objective probabilities. Jeffreys's law, my next topic, states that two successful probability forecasters must issue forecasts that are close to each other, thus supporting the idea of objective probabilities. Finally, I will discuss connections between subjective and frequentist probability.
Retrain or not retrain: Conformal test martingales for change-point detection
Vovk, Vladimir, Petej, Ivan, Nouretdinov, Ilia, Ahlberg, Ernst, Carlsson, Lars, Gammerman, Alex
The standard assumption in mainstream machine learning is that the observed data are IID (independent and identically distributed); we will refer to it as the IID assumption. Deviations from the IID assumption are known as dataset shift, and different kinds of dataset shift have become a popular topic of research (see, e.g., Quiรฑonero-Candela et al. (2009)). Testing the IID assumption has been a popular topic in statistics (see, e.g., Lehmann (2006), Chapter 7), but the mainstream work in statistics concentrates on the batch setting with each observation being a real number. In the context of deciding whether a prediction algorithm needs to be retrained, it is more important to process data online, so that at each point in time we have an idea of the degree to which the IID assumption has been discredited. It is also important that the observations are not just real numbers; in the context of machine learning the most important case is where each observation is a pair (x, y) consisting of a sample x (such as an image) and its label y. The existing work on detecting dataset shift in machine learning (see, e.g., Harel et al. (2014) and its literature review) does not have these shortcomings but does not test the IID assumption directly.
Testing for concept shift online
Vovk, Vladimir
The most standard way of testing statistical hypotheses is batch testing: we try to reject a given null hypothesis based on a batch of data. The alternative approach of online testing (see, e.g., [10] or [9]) consists in constructing a nonnegative process that is a martingale under the null hypothesis. The ratio of the current value of such a process to its initial value can be interpreted as the amount of evidence found against the null hypothesis. The standard assumption in machine learning is the (general) IID assumption, sometimes referred to (especially in older literature) as the assumption of randomness: the observations are assumed to be independent and identically distributed, but nothing is assumed about the probability measure generating a single observation. Interestingly, there exist processes, exchangeability martingales, that are martingales under the IID assumption; they can be constructed (see, e.g., [14, Section 7.1] or [13]) using the method of conformal prediction [14, Chapter 2]. Deviations from the IID assumption have become a popular topic of research in machine learning under the name of dataset shift [6, 7]; in my terminology I will follow mostly [6].
Training conformal predictors
Colombo, Nicolo, Vovk, Vladimir
Efficiency criteria for conformal prediction, such as \emph{observed fuzziness} (i.e., the sum of p-values associated with false labels), are commonly used to \emph{evaluate} the performance of given conformal predictors. Here, we investigate whether it is possible to exploit efficiency criteria to \emph{learn} classifiers, both conformal predictors and point classifiers, by using such criteria as training objective functions. The proposed idea is implemented for the problem of binary classification of hand-written digits. By choosing a 1-dimensional model class (with one real-valued free parameter), we can solve the optimization problems through an (approximate) exhaustive search over (a discrete version of) the parameter space. Our empirical results suggest that conformal predictors trained by minimizing their observed fuzziness perform better than conformal predictors trained in the traditional way by minimizing the \emph{prediction error} of the corresponding point classifier. They also have a reasonable performance in terms of their prediction error on the test set.
Power and limitations of conformal martingales
Vovk, Vladimir
A standard assumption in machine learning has been the assumption that the data are generated in the IID fashion, i.e., independently from the same distribution. This assumption is also known as the assumption of randomness (see, e.g., [11, Section 7.1] and [27]). In this paper we are interested in testing this assumption. Conformal martingales are constructed on top of conventional machinelearning algorithms and have been used as a means of detecting deviations from randomness both in theoretical work (see, e.g., [27, Section 7.1], [4], [3]) and in practice (in the framework of the Microsoft Azure module on time series anomaly detection [28]). They provide an online measure of the amount of evidence found against the hypothesis of randomness and can be said to perform conformal change detection: if the assumption of randomness is satisfied, a fixed nonnegative conformal martingale with a positive initial value is not expected to increase its initial value manyfold; on the other hand, if the hypothesis of randomness is violated, a properly designed nonnegative conformal martingale with a positive initial value can be expected to increase its value substantially. Correspondingly, we have two desiderata for such a martingale S: - Validity is satisfied automatically: S is not expected to ever increase its initial value by much, under the hypothesis of randomness.
Conformal calibrators
Vovk, Vladimir, Petej, Ivan, Toccaceli, Paolo, Gammerman, Alex
Conformal predictive distributions were inspired by the work on predictive distributions inparametric statistics (see, e.g., [7, Chapter 12] and [8]) and first suggested in [14]. As usual, we will refer to algorithms producing conformal predictive distributions as conformal predictive systems (CPS, used in both singular andplural senses). Conformal predictive systems are built on top of traditional prediction algorithms toensure a property of validity usually referred to as calibration in probability [3]. Several versions of the Least Squares Prediction Machine, CPS based on the method of Least Squares, are constructed in [14]. This construction isslightly extended to cover ridge regression and then further extended to nonlinear settings by applying the kernel trick in [12]. However, even after this extension the method is not fully adaptive, even for a universal kernel. As explained in [12, Section 7], the universality of the kernel shows in the ability of the predictive distribution function to take any shape; however, the CPS is still inflexible in that the shape does not depend, or depends weakly, on the test object. Formany base algorithms full CPS (like full conformal predictors in general) are computationally inefficient, and [13] define and study computationally efficient versionsof CPS, namely split-conformal predictive systems (SCPS) and 1 cross-conformal predictive systems (CCPS).
Conformal predictive distributions with kernels
Vovk, Vladimir, Nouretdinov, Ilia, Manokhin, Valery, Gammerman, Alex
This paper reviews the checkered history of predictive distributions in statistics and discusses two developments, one from recent literature and the other new. The first development is bringing predictive distributions into machine learning, whose early development was so deeply influenced by two remarkable groups at the Institute of Automation and Remote Control. The second development is combining predictive distributions with kernel methods, which were originated by one of those groups, including Emmanuel Braverman.