Goto

Collaborating Authors

 Optimization


Uncertainty Principle based optimization; new metaheuristics framework

arXiv.org Artificial Intelligence

To more flexibly balance between exploration and exploitation, a new meta-heuristic method based on Uncertainty Principle concepts is proposed in this paper. UP is is proved effective in multiple branches of science. In the branch of quantum mechanics, canonically conjugate observables such as position and momentum cannot both be distinctly determined in any quantum state. In the same manner, the branch of Spectral filtering design implies that a nonzero function and its Fourier transform cannot both be sharply localized. After delving into such concepts on Uncertainty Principle and their variations in quantum physics, Fourier analysis, and wavelet design, the proposed framework is described in terms of algorithm and flowchart. Our proposed optimizer's idea is based on an inherent uncertainty in performing local search versus global solution search. A set of compatible metrics for each part of the framework is proposed to derive preferred form of algorithm. Evaluations and comparisons at the end of paper show competency and distinct capability of the algorithm over some of the well-known and recently proposed metaheuristics.


Maximizing Cumulative User Engagement in Sequential Recommendation: An Online Optimization Perspective

arXiv.org Artificial Intelligence

To maximize cumulative user engagement (e.g. cumulative clicks) in sequential recommendation, it is often needed to tradeoff two potentially conflicting objectives, that is, pursuing higher immediate user engagement (e.g., click-through rate) and encouraging user browsing (i.e., more items exposured). Existing works often study these two tasks separately, thus tend to result in sub-optimal results. In this paper, we study this problem from an online optimization perspective, and propose a flexible and practical framework to explicitly tradeoff longer user browsing length and high immediate user engagement. Specifically, by considering items as actions, user's requests as states and user leaving as an absorbing state, we formulate each user's behavior as a personalized Markov decision process (MDP), and the problem of maximizing cumulative user engagement is reduced to a stochastic shortest path (SSP) problem. Meanwhile, with immediate user engagement and quit probability estimation, it is shown that the SSP problem can be efficiently solved via dynamic programming. Experiments on real-world datasets demonstrate the effectiveness of the proposed approach. Moreover, this approach is deployed at a large E-commerce platform, achieved over 7% improvement of cumulative clicks.


Acceleration of Descent-based Optimization Algorithms via Carath\'eodory's Theorem

arXiv.org Machine Learning

We propose a new technique to accelerate algorithms based on Gradient Descent using Carath\'eodory's Theorem. In the case of the standard Gradient Descent algorithm, we analyse the theoretical convergence of the approach under convexity assumptions and empirically display its ameliorations. As a core contribution, we then present an application of the acceleration technique to Block Coordinate Descent methods. Experimental comparisons on least squares regression with a LASSO regularisation term show remarkably improved performance on LASSO than the ADAM and SAG algorithms.


A modification of quasi-Newton's methods helping to avoid saddle points

arXiv.org Machine Learning

We recall that if $A$ is an invertible and symmetric real $m\times m$ matrix, then it is diagonalisable. Therefore, if we denote by $\mathcal{E}^{+}(A)\subset \mathbb{R}^m$ (respectively $\mathcal{E}^{-}(A)\subset \mathbb{R}^m$) to be the vector subspace generated by eigenvectors with positive eigenvalues of $A$ (correspondingly the vector subspace generated by eigenvectors with negative eigenvalues of $A$), then we have an orthogonal decomposition $\mathbb{R}^m=\mathcal{E}^{+}(A)\oplus \mathcal{E}^{-}(A)$. Hence, every $x\in \mathbb{R}^m$ can be written uniquely as $x=pr_{A,+}(x)+pr_{A,-}(x)$ with $pr_{A,+}(x)\in \mathcal{E}^{+}(A)$ and $pr_{A,-}(x)\in \mathcal{E}^{-}(A)$. We propose the following simple new modification of quasi-Newton's methods. {\bf New Q-Newton's method.} Let $\Delta =\{\delta _0,\delta _1,\delta _2,\ldots \}$ be a countable set of real numbers which has at least $m+1$ elements. Let $f:\mathbb{R}^m\rightarrow \mathbb{R}$ be a $C^2$ function. Let $\alpha >0$. For each $x\in \mathbb{R}^m$ such that $\nabla f(x)\not=0$, let $\delta (x)=\delta _j$, where $j$ is the smallest number so that $\nabla ^2f(x)+\delta _j||\nabla f(x)||^{1+\alpha}Id$ is invertible. (If $\nabla f(x)=0$, then we choose $\delta (x)=\delta _0$.) Let $x_0\in \mathbb{R}^m$ be an initial point. We define a sequence of $x_n\in \mathbb{R}^m$ and invertible and symmetric $m\times m$ matrices $A_n$ as follows: $A_n=\nabla ^2f(x_n)+\delta (x_n) ||\nabla f(x_n)||^{1+\alpha}Id$ and $x_{n+1}=x_n-w_n$, where $w_n=pr_{A_n,+}(v_n)-pr_{A_n,-}(v_n)$ and $v_n=A_n^{-1}\nabla f(x_n)$. The main result of this paper roughly says that if $f$ is $C^3$ and a sequence $\{x_n\}$, constructed by the New Q-Newton's method from a random initial point $x_0$, {\bf converges}, then the limit point is not a saddle point, and the convergence rate is the same as that of Newton's method.


Recht-R\'e Noncommutative Arithmetic-Geometric Mean Conjecture is False

arXiv.org Machine Learning

Stochastic optimization algorithms have become indispensable in modern machine learning. An unresolved foundational question in this area is the difference between with-replacement sampling and without-replacement sampling -- does the latter have superior convergence rate compared to the former? A groundbreaking result of Recht and R\'e reduces the problem to a noncommutative analogue of the arithmetic-geometric mean inequality where $n$ positive numbers are replaced by $n$ positive definite matrices. If this inequality holds for all $n$, then without-replacement sampling indeed outperforms with-replacement sampling. The conjectured Recht-R\'e inequality has so far only been established for $n = 2$ and a special case of $n = 3$. We will show that the Recht-R\'e conjecture is false for general $n$. Our approach relies on the noncommutative Positivstellensatz, which allows us to reduce the conjectured inequality to a semidefinite program and the validity of the conjecture to certain bounds for the optimum values, which we show are false as soon as $n = 5$.


ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

arXiv.org Machine Learning

We introduce AdaHessian, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the Hessian. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and ADAM. The main disadvantage of traditional second order methods is their heavier per-iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in AdaHessian, including: (i) a new variance reduction estimate of the Hessian diagonal with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that AdaHessian achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of ADAM. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that AdaHessian: (i) achieves 1.80\%/1.45\% higher accuracy on ResNets20/32 on Cifar10, and 5.55\% higher accuracy on ImageNet as compared to ADAM; (ii) outperforms ADAMW for transformers by 0.27/0.33 BLEU score on IWSLT14/WMT14 and 1.8/1.0 PPL on PTB/Wikitext-103; and (iii) achieves 0.032\% better score than AdaGrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of AdaHessian is comparable to first-order methods, and that it exhibits robustness towards its hyperparameters. The code for AdaHessian is open-sourced and publicly available.


Dynamic Bidding Strategies with Multivariate Feedback Control for Multiple Goals in Display Advertising

arXiv.org Machine Learning

Real-Time Bidding (RTB) display advertising is a method for purchasing display advertising inventory in auctions that occur within milliseconds. The performance of RTB campaigns is generally measured with a series of Key Performance Indicators (KPIs) - measurements used to ensure that the campaign is cost-effective and that it is purchasing valuable inventory. While an RTB campaign should ideally meet all KPIs, simultaneous improvement tends to be very challenging, as an improvement to any one KPI risks a detrimental effect toward the others. Here we present an approach to simultaneously controlling multiple KPIs with a PID-based feedback-control system. This method generates a control score for each KPI, based on both the output of a PID controller module and a metric that quantifies the importance of each KPI for internal business needs. On regular intervals, this algorithm - Sequential Control - will choose the KPI with the greatest overall need for improvement. In this way, our algorithm is able to continually seek the greatest marginal improvements to its current state. Multiple methods of control can be associated with each KPI, and can be triggered either simultaneously or chosen stochastically, in order to avoid local optima. In both offline ad bidding simulations and testing on live traffic, our methods proved to be effective in simultaneously controlling multiple KPIs, and bringing them toward their respective goals.


Robust Reinforcement Learning with Wasserstein Constraint

arXiv.org Machine Learning

Robust Reinforcement Learning aims to find the optimal policy with some extent of robustness to environmental dynamics. Existing learning algorithms usually enable the robustness through disturbing the current state or simulating environmental parameters in a heuristic way, which lack quantified robustness to the system dynamics (i.e. transition probability). To overcome this issue, we leverage Wasserstein distance to measure the disturbance to the reference transition kernel. With Wasserstein distance, we are able to connect transition kernel disturbance to the state disturbance, i.e. reduce an infinite-dimensional optimization problem to a finite-dimensional risk-aware problem. Through the derived risk-aware optimal Bellman equation, we show the existence of optimal robust policies, provide a sensitivity analysis for the perturbations, and then design a novel robust learning algorithm--Wasserstein Robust Advantage Actor-Critic algorithm (WRAAC). The effectiveness of the proposed algorithm is verified in the Cart-Pole environment.


Bayesian Optimisation vs. Input Uncertainty Reduction

arXiv.org Machine Learning

Simulators often require calibration inputs estimated from real world data and the quality of the estimate can significantly affect simulation output. Particularly when performing simulation optimisation to find an optimal solution, the uncertainty in the inputs significantly affects the quality of the found solution. One remedy is to search for the solution that has the best performance on average over the uncertain range of inputs yielding an optimal compromise solution. We consider the more general setting where a user may choose between either running simulations or instead collecting real world data. A user may choose an input and a solution and observe the simulation output, or instead query an external data source improving the input estimate enabling the search for a more focused, less compromised solution. We explicitly examine the trade-off between simulation and real data collection in order to find the optimal solution of the simulator with the true inputs. Using a value of information procedure, we propose a novel unified simulation optimisation procedure called Bayesian Information Collection and Optimisation (BICO) that, in each iteration, automatically determines which of the two actions (running simulations or data collection) is more beneficial. Numerical experiments demonstrate that the proposed algorithm is able to automatically determine an appropriate balance between optimisation and data collection.


Global Convergence of MAML for LQR

arXiv.org Machine Learning

The paper studies the performance of the Model-Agnostic Meta-Learning (MAML) algorithm as an optimization method. The goal is to determine the global convergence of MAML on sequential decision-making tasks possessing a common structure. We prove that the benign landscape of a single task leads to the global convergence of MAML in the single-task scenario and in the scenario of multiple structurally connected tasks. We also show that there is a two-task scenario that does not possess this global convergence property even for identical tasks. We analyze the landscape of the MAML objective on LQR tasks to determine what type of similarities in their structures enables the algorithm to converge to the globally optimal solution.