alr approach
Pool-Based Unsupervised Active Learning for Regression Using Iterative Representativeness-Diversity Maximization (iRDM)
Liu, Ziang, Jiang, Xue, Luo, Hanbin, Fang, Weili, Liu, Jiajing, Wu, Dongrui
Active learning (AL) selects the most beneficial unlabeled samples to label, and hence a better machine learning model can be trained from the same number of labeled samples. Most existing active learning for regression (ALR) approaches are supervised, which means the sampling process must use some label information, or an existing regression model. This paper considers completely unsupervised ALR, i.e., how to select the samples to label without knowing any true label information. We propose a novel unsupervised ALR approach, iterative representativeness-diversity maximization (iRDM), to optimally balance the representativeness and the diversity of the selected samples. Experiments on 12 datasets from various domains demonstrated its effectiveness. Our iRDM can be applied to both linear regression and kernel regression, and it even significantly outperforms supervised ALR when the number of labeled samples is small.
Integrating Informativeness, Representativeness and Diversity in Pool-Based Sequential Active Learning for Regression
In many real-world machine learning applications, unlabeled samples are easy to obtain, but it is expensive and/or time-consuming to label them. Active learning is a common approach for reducing this data labeling effort. It optimally selects the best few samples to label, so that a better machine learning model can be trained from the same number of labeled samples. This paper considers active learning for regression (ALR) problems. Three essential criteria -- informativeness, representativeness, and diversity -- have been proposed for ALR. However, very few approaches in the literature have considered all three of them simultaneously. We propose three new ALR approaches, with different strategies for integrating the three criteria. Extensive experiments on 12 datasets in various domains demonstrated their effectiveness.
Unsupervised Pool-Based Active Learning for Linear Regression
In many real-world machine learning applications, unlabeled data can be easily obtained, but it is very time-consuming and/or expensive to label them. So, it is desirable to be able to select the optimal samples to label, so that a good machine learning model can be trained from a minimum amount of labeled data. Active learning (AL) has been widely used for this purpose. However, most existing AL approaches are supervised: they train an initial model from a small amount of labeled samples, query new samples based on the model, and then update the model iteratively. Few of them have considered the completely unsupervised AL problem, i.e., starting from zero, how to optimally select the very first few samples to label, without knowing any label information at all. This problem is very challenging, as no label information can be utilized. This paper studies unsupervised pool-based AL for linear regression problems. We propose a novel AL approach that considers simultaneously the informativeness, representativeness, and diversity, three essential criteria in AL. Extensive experiments on 14 datasets from various application domains, using three different linear regression models (ridge regression, LASSO, and linear support vector regression), demonstrated the effectiveness of our proposed approach.
Active Stacking for Heart Rate Estimation
Wu, Dongrui, Liu, Feifei, Liu, Chengyu
Heart rate estimation from electrocardiogram signals is very important for the early detection of cardiovascular diseases. However, due to large individual differences and varying electrocardiogram signal quality, there does not exist a single reliable estimation algorithm that works well on all subjects. Every algorithm may break down on certain subjects, resulting in a significant estimation error. Ensemble regression, which aggregates the outputs of multiple base estimators for more reliable and stable estimates, can be used to remedy this problem. Moreover, active learning can be used to optimally select a few trials from a new subject to label, based on which a stacking ensemble regression model can be trained to aggregate the base estimators. This paper proposes four active stacking approaches, and demonstrates that they all significantly outperform three common unsupervised ensemble regression approaches, and a supervised stacking approach which randomly selects some trials to label. Remarkably, our active stacking approaches only need three or four labeled trials from each subject to achieve an average root mean squared estimation error below three beats per minute, making them very convenient for real-world applications. To our knowledge, this is the first research on active stacking, and its application to heart rate estimation.
Active Learning for Regression Using Greedy Sampling
Wu, Dongrui, Lin, Chin-Teng, Huang, Jian
Regression problems are pervasive in real-world applications. Generally a substantial amount of labeled samples are needed to build a regression model with good generalization ability. However, many times it is relatively easy to collect a large number of unlabeled samples, but time-consuming or expensive to label them. Active learning for regression (ALR) is a methodology to reduce the number of labeled samples, by selecting the most beneficial ones to label, instead of random selection. This paper proposes two new ALR approaches based on greedy sampling (GS). The first approach (GSy) selects new samples to increase the diversity in the output space, and the second (iGS) selects new samples to increase the diversity in both input and output spaces. Extensive experiments on 12 UCI and CMU StatLib datasets from various domains, and on 15 subjects on EEG-based driver drowsiness estimation, verified their effectiveness and robustness.
Affect Estimation in 3D Space Using Multi-Task Active Learning for Regression
Acquisition of labeled training samples for affective computing is usually costly and time-consuming, as affects are intrinsically subjective, subtle and uncertain, and hence multiple human assessors are needed to evaluate each affective sample. Particularly, for affect estimation in the 3D space of valence, arousal and dominance, each assessor has to perform the evaluations in three dimensions, which makes the labeling problem even more challenging. Many sophisticated machine learning approaches have been proposed to reduce the data labeling requirement in various other domains, but so far few have considered affective computing. This paper proposes two multi-task active learning for regression approaches, which select the most beneficial samples to label, by considering the three affect primitives simultaneously. Experimental results on the VAM corpus demonstrated that our optimal sample selection approaches can result in better estimation performance than random selection and several traditional single-task active learning approaches. Thus, they can help alleviate the data labeling problem in affective computing, i.e., better estimation performance can be obtained from fewer labeling queries.
Pool-Based Sequential Active Learning for Regression
Active learning (AL) [33], a subfield of machine learning, considers the following problem: if the learning algorithm can choose the training data, then which training samples should it choose to maximize the learning performance, under a fixed budget, e.g., the maximum number of labeled training samples? As an example, consider emotion estimation in affective computing [28]. Emotions can be represented as continuous numbers in the 2D space of arousal and valence [30], or in the 3D space of arousal, valence, and dominance [26]. However, emotions are very subjective, subtle, and uncertain. So, usually multiple human assessors are needed to obtain the groundtruth emotion values for each affective sample (video, audio, image, physiological signal, etc). For example, 14-16 assessors were used to evaluate each video clip in the DEAP dataset [21], six to 17 assessors for each utterance in the VAM (Vera am Mittag in German, Vera at Noon in English) spontaneous speech corpus [16], and at least 110 assessors for each sound in the IADS-2 (International Affective Digitized Sounds 2nd Edition) dataset [4]. This is very time-consuming and labor-intensive. How should we optimally select the affective samples to label so that an accurate regression model can be built with the minimum cost (i.e., the minimum number of labeled samples)?