Optimization
Dimension-free convergence rates for gradient Langevin dynamics in RKHS
Muzellec, Boris, Sato, Kanji, Massias, Mathurin, Suzuki, Taiji
Gradient Langevin dynamics (GLD) and stochastic GLD (SGLD) have attracted considerable attention lately, as a way to provide convergence guarantees in a non-convex setting. However, the known rates grow exponentially with the dimension of the space. In this work, we provide a convergence analysis of GLD and SGLD when the optimization space is an infinite dimensional Hilbert space. More precisely, we derive non-asymptotic, dimension-free convergence rates for GLD/SGLD when performing regularized non-convex optimization in a reproducing kernel Hilbert space. Amongst others, the convergence analysis relies on the properties of a stochastic differential equation, its discrete time Galerkin approximation and the geometric ergodicity of the associated Markov chains.
Adaptive Federated Optimization
Reddi, Sashank, Charles, Zachary, Zaheer, Manzil, Garrett, Zachary, Rush, Keith, Konečný, Jakub, Kumar, Sanjiv, McMahan, H. Brendan
Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Due to the heterogeneity of the client datasets, standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.
Tightly Robust Optimization via Empirical Domain Reduction
Yabe, Akihiro, Maehara, Takanori
Data-driven decision-making is performed by solving a parameterized optimization problem, and the optimal decision is given by an optimal solution for unknown true parameters. We often need a solution that satisfies true constraints even though these are unknown. Robust optimization is employed to obtain such a solution, where the uncertainty of the parameter is represented by an ellipsoid, and the scale of robustness is controlled by a coefficient. In this study, we propose an algorithm to determine the scale such that the solution has a good objective value and satisfies the true constraints with a given confidence probability. Under some regularity conditions, the scale obtained by our algorithm is asymptotically $O(1/\sqrt{n})$, whereas the scale obtained by a standard approach is $O(\sqrt{d/n})$. This means that our algorithm is less affected by the dimensionality of the parameters.
Decentralized gradient methods: does topology matter?
Neglia, Giovanni, Xu, Chuan, Towsley, Don, Calbi, Gianmarco
Consensus-based distributed optimization methods have recently been advocated as alternatives to parameter server and ring all-reduce paradigms for large scale training of machine learning models. In this case, each worker maintains a local estimate of the optimal parameter vector and iteratively updates it by averaging the estimates obtained from its neighbors, and applying a correction on the basis of its local dataset. While theoretical results suggest that worker communication topology should have strong impact on the number of epochs needed to converge, previous experiments have shown the opposite conclusion. This paper sheds lights on this apparent contradiction and show how sparse topologies can lead to faster convergence even in the absence of communication delays.
Do optimization methods in deep learning applications matter?
Ozyildirim, Buse Melis, Kiran, Mariam
With advances in deep learning, exponential data growth and increasing model complexity, developing efficient optimization methods are attracting much research attention. Several implementations favor the use of Conjugate Gradient (CG) and Stochastic Gradient Descent (SGD) as being practical and elegant solutions to achieve quick convergence, however, these optimization processes also present many limitations in learning across deep learning applications. Recent research is exploring higher-order optimization functions as better approaches, but these present very complex computational challenges for practical use. Comparing first and higher-order optimization functions, in this paper, our experiments reveal that Levemberg-Marquardt (LM) significantly supersedes optimal convergence but suffers from very large processing time increasing the training complexity of both, classification and reinforcement learning problems. Our experiments compare off-the-shelf optimization functions(CG, SGD, LM and L-BFGS) in standard CIFAR, MNIST, CartPole and FlappyBird experiments.The paper presents arguments on which optimization functions to use and further, which functions would benefit from parallelization efforts to improve pretraining time and learning rate convergence.
Causality and Robust Optimization
A decision-maker must consider cofounding bias when attempting to apply machine learning prediction, and, while feature selection is widely recognized as important process in data-analysis, it could cause cofounding bias. A causal Bayesian network is a standard tool for describing causal relationships, and if relationships are known, then adjustment criteria can determine with which features cofounding bias disappears. A standard modification would thus utilize causal discovery algorithms for preventing cofounding bias in feature selection. Causal discovery algorithms, however, essentially rely on the faithfulness assumption, which turn out to be easily violated in practical feature selection settings. In this paper, we propose a meta-algorithm that can remedy existing feature selection algorithms in terms of cofounding bias. Our algorithm is induced from a novel adjustment criterion that requires rather than faithfulness, an assumption which can be induced from another well-known assumption of the causal sufficiency. We further prove that the features added through our modification convert cofounding bias into prediction variance. With the aid of existing robust optimization technologies that regularize risky strategies with high variance, then, we are able to successfully improve the throughput performance of decision-making optimization, as is shown in our experimental results.
How to Evaluate Solutions in Pareto-based Search-Based Software Engineering? A Critical Review and Methodological Guidance
Li, Miqing, Chen, Tao, Yao, Xin
With modern requirements, there is an increasing tendancy of considering multiple objectives/criteria simultaneously in many Software Engineering (SE) scenarios. Such a multi-objective optimization scenario comes with an important issue --- how to evaluate the outcome of optimization algorithms, which typically is a set of incomparable solutions (i.e., being Pareto non-dominated to each other). This issue can be challenging for the SE community, particularly for practitioners of Search-Based SE (SBSE). On one hand, multiobjective optimization may still be relatively new to SE/SBSE researchers, who may not be able to identify right evaluation methods for their problems. On the other hand, simply following the evaluation methods for general multiobjective optimisation problems may not be appropriate for specific SE problems, especially when the problem nature or decision maker's preferences are explicitly/implicitly available. This has been well echoed in the literature by various inappropriate/inadequate selection and inaccurate/misleading uses of evaluation methods. In this paper, we carry out a critical review of quality evaluation for multiobjective optimization in SBSE. We survey 717 papers published between 2009 and 2019 from 36 venues in 7 repositories, and select 97 prominent studies, through which we identify five important but overlooked issues in the area. We then conduct an in-depth analysis of quality evaluation indicators and general situations in SBSE, which, together with the identified issues, enables us to provide a methodological guidance to selecting and using evaluation methods in different SBSE scenarios.
Graph Representation Learning for Merchant Incentive Optimization in Mobile Payment Marketing
Liu, Ziqi, Wang, Dong, Yu, Qianyu, Zhang, Zhiqiang, Shen, Yue, Ma, Jian, Zhong, Wenliang, Gu, Jinjie, Zhou, Jun, Yang, Shuang, Qi, Yuan
Mobile payment such as Alipay has been widely used in our daily lives. To further promote the mobile payment activities, it is important to run marketing campaigns under a limited budget by providing incentives such as coupons, commissions to merchants. As a result, incentive optimization is the key to maximizing the commercial objective of the marketing campaign. With the analyses of online experiments, we found that the transaction network can subtly describe the similarity of merchants' responses to different incentives, which is of great use in the incentive optimization problem. In this paper, we present a graph representation learning method atop of transaction networks for merchant incentive optimization in mobile payment marketing. With limited samples collected from online experiments, our end-to-end method first learns merchant representations based on an attributed transaction networks, then effectively models the correlations between the commercial objectives each merchant may achieve and the incentives under varying treatments. Thus we are able to model the sensitivity to incentive for each merchant, and spend the most budgets on those merchants that show strong sensitivities in the marketing campaign. Extensive offline and online experimental results at Alipay demonstrate the effectiveness of our proposed approach.
Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives
Muehlebach, Michael, Jordan, Michael I.
We analyze the convergence rate of various momentum-based optimization algorithms from a dynamical systems point of view. Our analysis exploits fundamental topological properties, such as the continuous dependence of iterates on their initial conditions, to provide a simple characterization of convergence rates. In many cases, closed-form expressions are obtained that relate algorithm parameters to the convergence rate. The analysis encompasses discrete time and continuous time, as well as time-invariant and time-variant formulations, and is not limited to a convex or Euclidean setting. In addition, the article rigorously establishes why symplectic discretization schemes are important for momentum-based optimization algorithms, and provides a characterization of algorithms that exhibit accelerated convergence.
Distributionally Robust Chance Constrained Programming with Generative Adversarial Networks (GANs)
This paper presents a novel deep learning based data-driven optimization method. A novel generative adversarial network (GAN) based data-driven distributionally robust chance constrained programming framework is proposed. GAN is applied to fully extract distributional information from historical data in a nonparametric and unsupervised way without a priori approximation or assumption. Since GAN utilizes deep neural networks, complicated data distributions and modes can be learned, and it can model uncertainty efficiently and accurately. Distributionally robust chance constrained programming takes into consideration ambiguous probability distributions of uncertain parameters. To tackle the computational challenges, sample average approximation method is adopted, and the required data samples are generated by GAN in an end-to-end way through the differentiable networks. The proposed framework is then applied to supply chain optimization under demand uncertainty. The applicability of the proposed approach is illustrated through a county-level case study of a spatially explicit biofuel supply chain in Illinois.