Convergence Rates of Active Learning for Maximum Likelihood Estimation
Chaudhuri, Kamalika, Kakade, Sham, Netrapalli, Praneeth, Sanghavi, Sujay
In active learning, we are given a sample space X, a label space Y, a class of models that map X to Y, and a large set U of unlabelled samples. The goal of the learner is to learn a model in the class with small target error while interactively querying the labels of as few of the unlabelled samples as possible. Most theoretical work on active learning has focussed on the PAC or the agnostic PAC model, where the goal is to learn binary classifiers that belong to a particular hypothesis class [2, 13, 9, 6, 3, 4,22], andtherehasbeenonlyahandful ofexceptions[19, 8,20]. Inthispaper, weshift ourattention to a more general setting - maximum likelihood estimation (MLE), where Pr(Y X) is described by a model θ belonging to a model class Θ. We show that when data is generated by a model in this class, we can do active learning provided the model class Θ has the following simple property: the Fisher information matrix for any model θ Θ at any (x,y) depends only on x and θ. This condition is satisfied in a number of widely applicable model classes, such as Linear Regression and Generalized Linear Models (GLMs), which in turn includes models for Multiclass Classification and Conditional 1 Random Fields. Consequently, we can provide active learning algorithms for maximum likelihood estimation in all these model classes. The standard solution to active MLE estimation in the statistics literature is to select samples for label query by optimizing a class of summary statistics of the asymptotic covariance matrix of the estimator [5]. The literature, however, does not provide any guidance towards which summary statistic should be used, or any analysis of the solution quality when a finite number of labels or samples are available.
Jun-8-2015