irdm
Active Learning for Regression by Inverse Distance Weighting
Active learning (AL) strategies are used in supervised learning to let the training algorithm "ask questions" [34], i.e., choose the feature vectors to query for the corresponding target value during the training phase, usually based on the model learned so far. The main aim of AL is to possibly reduce the number of training samples required to train the model, or in other words, to get a model of the same prediction quality with a smaller dataset. This is particularly useful when knowing the target value associated with a given combination of features is an expensive operation, for example, it may involve asking a human to "label" samples manually, running a costly and time-consuming laboratory experiment, or performing a complex computer simulation. AL methods are usually categorized in query synthesis (or population-based) methods, in which the feature vector to query can be chosen arbitrarily, pool-based sampling methods, in which the vector can only be chosen within a given finite set (or "pool") of unlabeled values, and selective-sampling methods, in which vectors are proposed in a streaming flow and the AL algorithm can only decide online whether to ask for the corresponding target or not [34]. Several approaches to AL are available in the literature, see, e.g., the survey papers [1, 16,22,34,39]. Most of the literature focuses on classification problems [1,33], although AL has been investigated also for regression [9-13,25,27,38,41,42].
Pool-Based Unsupervised Active Learning for Regression Using Iterative Representativeness-Diversity Maximization (iRDM)
Liu, Ziang, Jiang, Xue, Luo, Hanbin, Fang, Weili, Liu, Jiajing, Wu, Dongrui
Active learning (AL) selects the most beneficial unlabeled samples to label, and hence a better machine learning model can be trained from the same number of labeled samples. Most existing active learning for regression (ALR) approaches are supervised, which means the sampling process must use some label information, or an existing regression model. This paper considers completely unsupervised ALR, i.e., how to select the samples to label without knowing any true label information. We propose a novel unsupervised ALR approach, iterative representativeness-diversity maximization (iRDM), to optimally balance the representativeness and the diversity of the selected samples. Experiments on 12 datasets from various domains demonstrated its effectiveness. Our iRDM can be applied to both linear regression and kernel regression, and it even significantly outperforms supervised ALR when the number of labeled samples is small.