Optimization
Optimizing Ensemble Weights and Hyperparameters of Machine Learning Models for Regression Problems
Shahhosseini, Mohsen, Hu, Guiping, Pham, Hieu
Aggregating multiple learners through an ensemble of models aims to make better predictions by capturing the underlying distribution more accurately. Different ensembling methods, such as bagging, boosting and stacking/blending, have been studied and adopted extensively in research and practice. While bagging and boosting intend to reduce variance and bias, respectively, blending approaches target both by finding the optimal way to combine base learners to find the best trade-off between bias and variance. In blending, ensembles are created from weighted averages of multiple base learners. In this study, a systematic approach is proposed to find the optimal weights to create these ensembles for bias-variance tradeoff using cross-validation for regression problems (Cross-validated Optimal Weighted Ensemble (COWE)). Furthermore, it is known that tuning hyperparameters of each base learner inside the ensemble weight optimization process can produce better performing ensembles. To this end, a nested algorithm based on bi-level optimization that considers tuning hyperparameters as well as finding the optimal weights to combine ensembles (Cross-validated Optimal Weighted Ensemble with Internally Tuned Hyperparameters (COWE-ITH)) was proposed. The algorithm is shown to be generalizable to real data sets though analyses with ten publicly available data sets. The prediction accuracies of COWE-ITH and COWE have been compared to base learners and the state-of-art ensemble methods. The results show that COWE-ITH outperforms other benchmarks as well as base learners in 9 out of 10 data sets.
Inside the Mind and Methodology of a Data Scientist - Birst
When you hear about Data Science, Big Data, Analytics, Artificial Intelligence, Machine Learning, or Deep Learning, you may end up feeling a bit confused about what these terms mean. And it doesn't help reduce the confusion when every tech vendor rebrands their products as AI. So, what do these terms really mean? What are overlaps and differences? And most importantly, what can this do for your business?
Towards Assessing the Impact of Bayesian Optimization's Own Hyperparameters
Lindauer, Marius, Feurer, Matthias, Eggensperger, Katharina, Biedenkapp, Andrรฉ, Hutter, Frank
Treating the validation loss of trained machine learning models as a black box function f, we can formulate the hyperparameter optimization problem as: x arg min x X f (x) (1) where X is space of possible configurations x . Although the community is aware of the necessity of hy-perparameter optimization (HPO) for machine learning algorithms, the impact of BO's own hyperparameters is not reported in most BO papers. On top of this, new BO approaches (and implicitly their hyperparameters) are often developed on cheap-to-evaluate artificial functions and then evaluated on real benchmarks. Although we acknowledge that this is a reasonable protocol to prevent over-engineering on the target Contact Author function family (here for example HPO benchmarks of machine learning algorithms), we believe that it is important to study whether this practice is indeed well-founded. We emphasize that this paper considers HPO on two levels as shown in Figure 1: (i) HPO of machine learning algorithms, which we consider as our target function (Target BO) and (ii) optimization of the target-BO's own choices using a meta-optimizer. In particular, we study several research questions related to the meta-optimization problem of BO's hyperparameters: 1. How large is the impact of tuning BO's own hyperpa-rameters if one was allowed to tune these on each function independently? 2. How well does the performance of an optimized configuration of the target-BO generalize to similar new functions from the same family?
Computing Estimators of Dantzig Selector type via Column and Constraint Generation
Mazumder, Rahul, Wright, Stephen, Zheng, Andrew
We consider a class of linear-programming based estimators in reconstructing a sparse signal from linear measurements. Specific formulations of the reconstruction problem considered here include Dantzig selector, basis pursuit (for the case in which the measurements contain no errors), and the fused Dantzig selector (for the case in which the underlying signal is piecewise constant). In spite of being estimators central to sparse signal processing and machine learning, solving these linear programming problems for large scale instances remains a challenging task, thereby limiting their usage in practice. We show that classic constraint- and column-generation techniques from large scale linear programming, when used in conjunction with a commercial implementation of the simplex method, and initialized with the solution from a closely-related Lasso formulation, yields solutions with high efficiency in many settings.
Evolutionary Computation, Optimization and Learning Algorithms for Data Science
Mohammadi, Farid Ghareh, Amini, M. Hadi, Arabnia, Hamid R.
A large number of engineering, science and computational problems have yet to be solved in a computationally efficient way. One of the emerging challenges is how evolving technologies grow towards autonomy and intelligent decision making. This leads to collection of large amounts of data from various sensing and measurement technologies, e.g., cameras, smart phones, health sensors, smart electricity meters, and environment sensors. Hence, it is imperative to develop efficient algorithms for generation, analysis, classification, and illustration of data. Meanwhile, data is structured purposefully through different representations, such as large-scale networks and graphs. We focus on data science as a crucial area, specifically focusing on a curse of dimensionality (CoD) which is due to the large amount of generated/sensed/collected data. This motivates researchers to think about optimization and to apply nature-inspired algorithms, such as evolutionary algorithms (EAs) to solve optimization problems. Although these algorithms look un-deterministic, they are robust enough to reach an optimal solution. Researchers do not adopt evolutionary algorithms unless they face a problem which is suffering from placement in local optimal solution, rather than global optimal solution. In this chapter, we first develop a clear and formal definition of the CoD problem, next we focus on feature extraction techniques and categories, then we provide a general overview of meta-heuristic algorithms, its terminology, and desirable properties of evolutionary algorithms.
Federated Learning with Additional Mechanisms on Clients to Reduce Communication Costs
Yao, Xin, Huang, Tianchi, Wu, Chenglei, Zhang, Rui-xiao, Sun, Lifeng
Federated learning (FL) enables on-device training over distributed networks consisting of a massive amount of modern smart devices, such as smartphones and IoT devices. However, the leading optimization algorithm in such settings, i.e., federated averaging (FedAvg), suffers from heavy communication cost and inevitable performance drop, especially when the local data is distributed in a non-IID way. To alleviate this problem, we propose two potential solutions by introducing additional mechanisms to the on-device training. The first (FedMMD) is adopting a two-stream model with the MMD (Maximum Mean Discrepancy) constraint instead of a single model in vanilla FedAvg to be trained on devices. Experiments show that the proposed method outperforms baselines, especially in non-IID FL settings, with a reduction of more than 20% in required communication rounds. The second is FL with feature fusion (FedFusion). By aggregating the features from both the local and global models, we achieve higher accuracy at less communication cost. Furthermore, the feature fusion modules offer better initialization for newly incoming clients and thus speed up the process of convergence. Experiments in popular FL scenarios show that our FedFusion outperforms baselines in both accuracy and generalization ability while reducing the number of required communication rounds by more than 60%.
BOAH: A Tool Suite for Multi-Fidelity Bayesian Optimization & Analysis of Hyperparameters
Lindauer, Marius, Eggensperger, Katharina, Feurer, Matthias, Biedenkapp, Andrรฉ, Marben, Joshua, Mรผller, Philipp, Hutter, Frank
Hyperparameter optimization and neural architecture search can become prohibitively expensive for regular black-box Bayesian optimization because the training and evaluation of a single model can easily take several hours. To overcome this, we introduce a comprehensive tool suite for effective multi-fidelity Bayesian optimization and the analysis of its runs. The suite, written in Python, provides a simple way to specify complex design spaces, a robust and efficient combination of Bayesian optimization and HyperBand, and a comprehensive analysis of the optimization process and its outcomes.
Linear Stochastic Bandits Under Safety Constraints
Amani, Sanae, Alizadeh, Mahnoosh, Thrampoulidis, Christos
Bandit algorithms have various application in safety-critical systems, where it is important to respect the system constraints that rely on the bandit's unknown parameters at every round. In this paper, we formulate a linear stochastic multi-armed bandit problem with safety constraints that depend (linearly) on an unknown parameter vector. As such, the learner is unable to identify all safe actions and must act conservatively in ensuring that her actions satisfy the safety constraint at all rounds (at least with high probability). For these bandits, we propose a new UCB-based algorithm called Safe-LUCB, which includes necessary modifications to respect safety constraints. The algorithm has two phases. During the pure exploration phase the learner chooses her actions at random from a restricted set of safe actions with the goal of learning a good approximation of the entire unknown safe set. Once this goal is achieved, the algorithm begins a safe exploration-exploitation phase where the learner gradually expands their estimate of the set of safe actions while controlling the growth of regret. We provide a general regret bound for the algorithm, as well as a problem dependent bound that is connected to the location of the optimal action within the safe set. We then propose a modified heuristic that exploits our problem dependent analysis to improve the regret.
Safe global optimization of expensive noisy black-box functions in the $\delta$-Lipschitz framework
Sergeyev, Yaroslav D., Candelieri, Antonio, Kvasov, Dmitri E., Perego, Riccardo
In this paper, the problem of safe global maximization (it should not be confused with robust optimization) of expensive noisy black-box functions satisfying the Lipschitz condition is considered. The notion "safe" means that the objective function $f(x)$ during optimization should not violate a "safety" threshold, for instance, a certain a priori given value $h$ in a maximization problem. Thus, any new function evaluation must be performed at "safe points" only, namely, at points $y$ for which it is known that the objective function $f(y) > h$. The main difficulty here consists in the fact that the used optimization algorithm should ensure that the safety constraint will be satisfied at a point $y$ before evaluation of $f(y)$ will be executed. Thus, it is required both to determine the safe region $\Omega$ within the search domain~$D$ and to find the global maximum within $\Omega$. An additional difficulty consists in the fact that these problems should be solved in the presence of the noise. This paper starts with a theoretical study of the problem and it is shown that even though the objective function $f(x)$ satisfies the Lipschitz condition, traditional Lipschitz minorants and majorants cannot be used due to the presence of the noise. Then, a $\delta$-Lipschitz framework and two algorithms using it are proposed to solve the safe global maximization problem. The first method determines the safe area within the search domain and the second one executes the global maximization over the found safe region. For both methods a number of theoretical results related to their functioning and convergence is established. Finally, numerical experiments confirming the reliability of the proposed procedures are performed.
Multitask and Transfer Learning for Autotuning Exascale Applications
Sid-Lakhdar, Wissam M., Aznaveh, Mohsen Mahmoudi, Li, Xiaoye S., Demmel, James W.
Multitask learning and transfer learning have proven to be useful in the field of machine learning when additional knowledge is available to help a prediction task. We aim at deriving methods following these paradigms for use in autotuning, where the goal is to find the optimal performance parameters of an application treated as a black-box function. We show comparative results with state-of-the-art autotuning techniques. For instance, we observe an average $1.5x$ improvement of the application runtime compared to the OpenTuner and HpBandSter autotuners. We explain how our approaches can be more suitable than some state-of-the-art autotuners for the tuning of any application in general and of expensive exascale applications in particular.