Optimization
You May Not Need Ratio Clipping in PPO
Sun, Mingfei, Kurin, Vitaly, Liu, Guoqing, Devlin, Sam, Qin, Tao, Hofmann, Katja, Whiteson, Shimon
Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data. Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples. Ratio clipping yields a pessimistic estimate of the original surrogate objective, and has been shown to be crucial for strong performance. We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios. Instead, one can directly optimize the original surrogate objective for multiple epochs; the key is to find a proper condition to early stop the optimization epoch in each iteration. Our theoretical analysis sheds light on how to determine when to stop the optimization epoch, and call the resulting algorithm Early Stopping Policy Optimization (ESPO). We compare ESPO with PPO across many continuous control tasks and show that ESPO significantly outperforms PPO. Furthermore, we show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.
Fuzzy Segmentations of a String
Kostanyan, Armen, Harmandayan, Arevik
This article discusses a particular case of the data clustering problem, where it is necessary to find groups of adjacent text segments of the appropriate length that match a fuzzy pattern represented as a sequence of fuzzy properties. To solve this problem, a heuristic algorithm for finding a sufficiently large number of solutions is proposed. The key idea of the proposed algorithm is the use of the prefix structure to track the process of mapping text segments to fuzzy properties. An important special case of the text segmentation problem is the fuzzy string matching problem, when adjacent text segments have unit length and, accordingly, the fuzzy pattern is a sequence of fuzzy properties of text characters. It is proven that the heuristic segmentation algorithm in this case finds all text segments that match the fuzzy pattern. Finally, we consider the problem of a best segmentation of the entire text based on a fuzzy pattern, which is solved using the dynamic programming method.
A framework for bilevel optimization that enables stochastic and global variance reduction algorithms
Dagréou, Mathieu, Ablin, Pierre, Vaiter, Samuel, Moreau, Thomas
Bilevel optimization, the problem of minimizing a value function which involves the arg-minimum of another function, appears in many areas of machine learning. In a large scale setting where the number of samples is huge, it is crucial to develop stochastic methods, which only use a few samples at a time to progress. However, computing the gradient of the value function involves solving a linear system, which makes it difficult to derive unbiased stochastic estimates. To overcome this problem we introduce a novel framework, in which the solution of the inner problem, the solution of the linear system, and the main variable evolve at the same time. These directions are written as a sum, making it straightforward to derive unbiased estimates. The simplicity of our approach allows us to develop global variance reduction algorithms, where the dynamics of all variables is subject to variance reduction. We demonstrate that SABA, an adaptation of the celebrated SAGA algorithm in our framework, has $O(\frac1T)$ convergence rate, and that it achieves linear convergence under Polyak-Lojasciewicz assumption. This is the first stochastic algorithm for bilevel optimization that verifies either of these properties. Numerical experiments validate the usefulness of our method.
BEER: Fast $O(1/T)$ Rate for Decentralized Nonconvex Optimization with Communication Compression
Zhao, Haoyu, Li, Boyue, Li, Zhize, Richtárik, Peter, Chi, Yuejie
Communication efficiency has been widely recognized as the bottleneck for large-scale decentralized machine learning applications in multi-agent or federated environments. To tackle the communication bottleneck, there have been many efforts to design communication-compressed algorithms for decentralized nonconvex optimization, where the clients are only allowed to communicate a small amount of quantized information (aka bits) with their neighbors over a predefined graph topology. Despite significant efforts, the state-of-the-art algorithm in the nonconvex setting still suffers from a slower rate of convergence $O((G/T)^{2/3})$ compared with their uncompressed counterpart, where $G$ measures the data heterogeneity across different clients, and $T$ is the number of communication rounds. This paper proposes BEER, which adopts communication compression with gradient tracking, and shows it converges at a faster rate of $O(1/T)$. This significantly improves over the state-of-the-art rate, by matching the rate without compression even under arbitrary data heterogeneity. Numerical experiments are also provided to corroborate our theory and confirm the practical superiority of BEER in the data heterogeneous regime.
Potential Destination Prediction Based on Knowledge Graph Under Low Predictability Data Condition
Li, Guilong, Chen, Yixian, Liao, Qionghua, He, Zhaocheng
Destination prediction has been a critical topic in transportation research, and there are a large number of studies. However, almost all existing studies are based on high predictability data conditions while pay less attention to the data condition with low predictability, where the regularity of single individuals is not exposed. Based on a certain period of observation, there is a fact that individuals may choose destinations beyond observation, which we call "potential destinations". The number of potential destinations is very large and can't be ignored for the data condition with low predictability formed by short-term observation.To reveal the choice pattern of potential destination of individuals under the data condition with low predictability, we propose a global optimization method based on knowledge graph embedding. First, we joint the trip data of all individuals by constructing Trip Knowledge Graph(TKG). Next, we optimize the general algorithm of knowledge graph embedding for our data and task in training strategy and objective function, then implement it on TKG. It can achieve global optimization for association paths that exist between almost any two entities in TKG. On this basis, a method for potential destination prediction is proposed, giving the possible ranking of unobserved destinations for each individual. In addition, we improve the performance by fusing static statistical information that is not passed to TKG. Finally, we validate our method in a real-world dataset, and the prediction results are highly consistent with individuals' potential destination choice behaviour.
On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces
Bedi, Amrit Singh, Chakraborty, Souradip, Parayil, Anjaly, Sadler, Brian, Tokekar, Pratap, Koppel, Alec
We focus on parameterized policy search for reinforcement learning over continuous action spaces. Typically, one assumes the score function associated with a policy is bounded, which fails to hold even for Gaussian policies. To properly address this issue, one must introduce an exploration tolerance parameter to quantify the region in which it is bounded. Doing so incurs a persistent bias that appears in the attenuation rate of the expected policy gradient norm, which is inversely proportional to the radius of the action space. To mitigate this hidden bias, heavy-tailed policy parameterizations may be used, which exhibit a bounded score function, but doing so can cause instability in algorithmic updates. To address these issues, in this work, we study the convergence of policy gradient algorithms under heavy-tailed parameterizations, which we propose to stabilize with a combination of mirror ascent-type updates and gradient tracking. Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes, whereas prior works require these parameters to respectively shrink to null or grow to infinity. Experimentally, this scheme under a heavy-tailed policy parameterization yields improved reward accumulation across a variety of settings as compared with standard benchmarks.
GenMod: A generative modeling approach for spectral representation of PDEs with random inputs
Wentz, Jacqueline, Doostan, Alireza
We propose a method for quantifying uncertainty in high-dimensional PDE systems with random parameters, where the number of solution evaluations is small. Parametric PDE solutions are often approximated using a spectral decomposition based on polynomial chaos expansions. For the class of systems we consider (i.e., high dimensional with limited solution evaluations) the coefficients are given by an underdetermined linear system in a regression formulation. This implies additional assumptions, such as sparsity of the coefficient vector, are needed to approximate the solution. Here, we present an approach where we assume the coefficients are close to the range of a generative model that maps from a low to a high dimensional space of coefficients. Our approach is inspired be recent work examining how generative models can be used for compressed sensing in systems with random Gaussian measurement matrices. Using results from PDE theory on coefficient decay rates, we construct an explicit generative model that predicts the polynomial chaos coefficient magnitudes. The algorithm we developed to find the coefficients, which we call GenMod, is composed of two main steps. First, we predict the coefficient signs using Orthogonal Matching Pursuit. Then, we assume the coefficients are within a sparse deviation from the range of a sign-adjusted generative model. This allows us to find the coefficients by solving a nonconvex optimization problem, over the input space of the generative model and the space of sparse vectors. We obtain theoretical recovery results for a Lipschitz continuous generative model and for a more specific generative model, based on coefficient decay rate bounds. We examine three high-dimensional problems and show that, for all three examples, the generative model approach outperforms sparsity promoting methods at small sample sizes.
Riemannian block SPD coupling manifold and its application to optimal transport
Han, Andi, Mishra, Bamdev, Jawanpuria, Pratik, Gao, Junbin
Optimal transport (OT) has seen its popularity in various fields of applications. We start by observing that the OT problem can be viewed as an instance of a general symmetric positive definite (SPD) matrix-valued OT problem, where the cost, the marginals, and the coupling are represented as block matrices and each component block is a SPD matrix. The summation of row blocks and column blocks in the coupling matrix are constrained by the given block-SPD marginals. We endow the set of such block-coupling matrices with a novel Riemannian manifold structure. This allows to exploit the versatile Riemannian optimization framework to solve generic SPD matrix-valued OT problems. We illustrate the usefulness of the proposed approach in several applications.
Scaling Gaussian Process Optimization by Evaluating a Few Unique Candidates Multiple Times
Calandriello, Daniele, Carratino, Luigi, Lazaric, Alessandro, Valko, Michal, Rosasco, Lorenzo
Computing a Gaussian process (GP) posterior has a computational cost cubical in the number of historical points. A reformulation of the same GP posterior highlights that this complexity mainly depends on how many \emph{unique} historical points are considered. This can have important implication in active learning settings, where the set of historical points is constructed sequentially by the learner. We show that sequential black-box optimization based on GPs (GP-Opt) can be made efficient by sticking to a candidate solution for multiple evaluation steps and switch only when necessary. Limiting the number of switches also limits the number of unique points in the history of the GP. Thus, the efficient GP reformulation can be used to exactly and cheaply compute the posteriors required to run the GP-Opt algorithms. This approach is especially useful in real-world applications of GP-Opt with high switch costs (e.g. switching chemicals in wet labs, data/model loading in hyperparameter optimization). As examples of this meta-approach, we modify two well-established GP-Opt algorithms, GP-UCB and GP-EI, to switch candidates as infrequently as possible adapting rules from batched GP-Opt. These versions preserve all the theoretical no-regret guarantees while improving practical aspects of the algorithms such as runtime, memory complexity, and the ability of batching candidates and evaluating them in parallel.
Continual Learning with Recursive Gradient Optimization
Learning multiple tasks sequentially without forgetting previous knowledge, called Continual Learning(CL), remains a long-standing challenge for neural networks. Most existing methods rely on additional network capacity or data replay. In contrast, we introduce a novel approach which we refer to as Recursive Gradient Optimization(RGO). RGO is composed of an iteratively updated optimizer that modifies the gradient to minimize forgetting without data replay and a virtual Feature Encoding Layer(FEL) that represents different long-term structures with only task descriptors. Experiments demonstrate that RGO has significantly better performance on popular continual classification benchmarks when compared to the baselines and achieves new state-of-the-art performance on 20-split-CIFAR100(82.22%) and 20-split-miniImageNet(72.63%). With higher average accuracy than Single-Task Learning(STL), this method is flexible and reliable to provide continual learning capabilities for learning models that rely on gradient descent.