Goto

Collaborating Authors

 Optimization


Structural Pruning via Latency-Saliency Knapsack

Neural Information Processing Systems

Structural pruning can simplify network architecture and improve inference speed. We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget on targeting device. For filter importance ranking, HALP leverages latency lookup table to track latency reduction potential and global saliency score to gauge accuracy drop. Both metrics can be evaluated very efficiently during pruning, allowing us to reformulate global structural pruning under a reward maximization problem given target constraint. This makes the problem solvable via our augmented knapsack solver, enabling HALP to surpass prior work in pruning efficacy and accuracy-efficiency trade-off. We examine HALP on both classification and detection tasks, over varying networks, on ImageNet and VOC datasets, on different platforms.


Learning to Constrain Policy Optimization with Virtual Trust Region

Neural Information Processing Systems

We introduce a constrained optimization method for policy gradient reinforcement learning, which uses two trust regions to regulate each policy update. In addition to using the proximity of one single old policy as the first trust region as done by prior works, we propose forming a second trust region by constructing another virtual policy that represents a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. We propose a mechanism to automatically build the virtual policy from a memory buffer of past policies, providing a new capability for dynamically selecting appropriate trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.


Efficient Projection-free Algorithms for Saddle Point Problems

Neural Information Processing Systems

The Frank-Wolfe algorithm is a classic method for constrained optimization problems. It has recently been popular in many machine learning applications because its projection-free property leads to more efficient iterations. In this paper, we study projection-free algorithms for convex-strongly-concave saddle point problems with complicated constraints. Our method combines Conditional Gradient Sliding with Mirror-Prox and show that it only requires $\tilde{\cO}(1/\sqrt{\epsilon})$ gradient evaluations and $\tilde{\cO}(1/\epsilon^2)$ linear optimizations in the batch setting. We also extend our method to the stochastic setting and propose first stochastic projection-free algorithms for saddle point problems. Experimental results demonstrate the effectiveness of our algorithms and verify our theoretical guarantees.


Reinforced Genetic Algorithm for Structure-based Drug Design

Neural Information Processing Systems

Structure-based drug design (SBDD) aims to discover drug candidates by finding molecules (ligands) that bind tightly to a disease-related protein (targets), which is the primary approach to computer-aided drug discovery. Recently, applying deep generative models for three-dimensional (3D) molecular design conditioned on protein pockets to solve SBDD has attracted much attention, but their formulation as probabilistic modeling often leads to unsatisfactory optimization performance. On the other hand, traditional combinatorial optimization methods such as genetic algorithms (GA) have demonstrated state-of-the-art performance in various molecular optimization tasks. However, they do not utilize protein target structure to inform design steps but rely on a random-walk-like exploration, which leads to unstable performance and no knowledge transfer between different tasks despite the similar binding physics. To achieve a more stable and efficient SBDD, we propose Reinforced Genetic Algorithm (RGA) that uses neural models to prioritize the profitable design steps and suppress random-walk behavior. The neural models take the 3D structure of the targets and ligands as inputs and are pre-trained using native complex structures to utilize the knowledge of the shared binding physics from different targets and then fine-tuned during optimization. We conduct thorough empirical studies on optimizing binding affinity to various disease targets and show that RGA outperforms the baselines in terms of docking scores and is more robust to random initializations. The ablation study also indicates that the training on different targets helps improve the performance by leveraging the shared underlying physics of the binding processes.


Faster proximal algorithms for matrix optimization using Jacobi-based eigenvalue methods

Neural Information Processing Systems

We consider proximal splitting algorithms for convex optimization problems over matrices. A significant computational bottleneck in many of these algorithms is the need to compute a full eigenvalue or singular value decomposition at each iteration for the evaluation of a proximal operator.In this paper we propose to use an old and surprisingly simple method due to Jacobi to compute these eigenvalue and singular value decompositions, and we demonstrate that it can lead to substantial gains in terms of computation time compared to standard approaches. We rely on three essential properties of this method: (a) its ability to exploit an approximate decomposition as an initial point, which in the case of iterative optimization algorithms can be obtained from the previous iterate; (b) its parallel nature which makes it a great fit for hardware accelerators such as GPUs, now common in machine learning, and (c) its simple termination criterion which allows us to trade-off accuracy with computation time. We demonstrate the efficacy of this approach on a variety of algorithms and problems, and show that, on a GPU, we can obtain 5 to 10x speed-ups in the evaluation of proximal operators compared to standard CPU or GPU linear algebra routines. Our findings are supported by new theoretical results providing guarantees on the approximation quality of proximal operators obtained using approximate eigenvalue or singular value decompositions.


Multi-task Learning of Order-Consistent Causal Graphs

Neural Information Processing Systems

We consider the problem of discovering $K$ related Gaussian directed acyclic graphs (DAGs), where the involved graph structures share a consistent causal order and sparse unions of supports. Under the multi-task learning setting, we propose a $l_1/l_2$-regularized maximum likelihood estimator (MLE) for learning $K$ linear structural equation models. We theoretically show that the joint estimator, by leveraging data across related tasks, can achieve a better sample complexity for recovering the causal order (or topological order) than separate estimations. Moreover, the joint estimator is able to recover non-identifiable DAGs, by estimating them together with some identifiable DAGs. Lastly, our analysis also shows the consistency of union support recovery of the structures. To allow practical implementation, we design a continuous optimization problem whose optimizer is the same as the joint estimator and can be approximated efficiently by an iterative algorithm.


Empirical Likelihood for Contextual Bandits

Neural Information Processing Systems

We propose an estimator and confidence interval for computing the value of a policy from off-policy data in the contextual bandit setting. To this end we apply empirical likelihood techniques to formulate our estimator and confidence interval as simple convex optimization problems. Using the lower bound of our confidence interval, we then propose an off-policy policy optimization algorithm that searches for policies with large reward lower bound. We empirically find that both our estimator and confidence interval improve over previous proposals in finite sample regimes. Finally, the policy optimization algorithm we propose outperforms a strong baseline system for learning from off-policy data.


Necessary and Sufficient Conditions for Optimal Decision Trees using Dynamic Programming

Neural Information Processing Systems

Global optimization of decision trees has shown to be promising in terms of accuracy, size, and consequently human comprehensibility. However, many of the methods used rely on general-purpose solvers for which scalability remains an issue.Dynamic programming methods have been shown to scale much better because they exploit the tree structure by solving subtrees as independent subproblems. However, this only works when an objective can be optimized separately for subtrees.We explore this relationship in detail and show the necessary and sufficient conditions for such separability and generalize previous dynamic programming approaches into a framework that can optimize any combination of separable objectives and constraints.Experiments on five application domains show the general applicability of this framework, while outperforming the scalability of general-purpose solvers by a large margin.


On Convergence of FedProx: Local Dissimilarity Invariant Bounds, Non-smoothness and Beyond

Neural Information Processing Systems

The \FedProx~algorithm is a simple yet powerful distributed proximal point optimization method widely used for federated learning (FL) over heterogeneous data. Despite its popularity and remarkable success witnessed in practice, the theoretical understanding of FedProx is largely underinvestigated: the appealing convergence behavior of \FedProx~is so far characterized under certain non-standard and unrealistic dissimilarity assumptions of local functions, and the results are limited to smooth optimization problems. In order to remedy these deficiencies, we develop a novel local dissimilarity invariant convergence theory for \FedProx~and its minibatch stochastic extension through the lens of algorithmic stability. As a result, we contribute to derive several new and deeper insights into \FedProx~for non-convex federated optimization including: 1) convergence guarantees invariant to certain stringent local dissimilarity conditions; 2) convergence guarantees for non-smooth FL problems; and 3) linear speedup with respect to size of minibatch and number of sampled devices. Our theory for the first time reveals that local dissimilarity and smoothness are not must-have for \FedProx~to get favorable complexity bounds.


Low-Fidelity Video Encoder Optimization for Temporal Action Localization

Neural Information Processing Systems

Most existing temporal action localization (TAL) methods rely on a transfer learning pipeline: by first optimizing a video encoder on a large action classification dataset (i.e., source domain), followed by freezing the encoder and training a TAL head on the action localization dataset (i.e., target domain). This results in a task discrepancy problem for the video encoder - trained for action classification, but used for TAL. Intuitively, joint optimization with both the video encoder and TAL head is a strong baseline solution to this discrepancy. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper, we resolve this challenge by introducing a novel low-fidelity (LoFi) video encoder optimization method.