Optimization
IPBoost -- Non-Convex Boosting via Integer Programming
Pfetsch, Marc E., Pokutta, Sebastian
Boosting is an important (and by now standard) technique in classification to combine several'low accuracy' learners, so-called base learners, into a'high accuracy' learner, a so-called boosted learner. Pioneered by the AdaBoost approach of [19], in recent decades there has been extensive work on boosting procedures and analyses of their limitations. In a nutshell, boosting procedures are (typically) iterative schemes that roughly work as follows: for t 1,..., T do the following:
Differentiating the Black-Box: Optimization with Local Generative Surrogates
Shirobokov, Sergey, Belavin, Vladislav, Kagan, Michael, Ustyuzhanin, Andrey, Baydin, Atılım Güneş
We propose a novel method for gradient-based optimization of black-box simulators using differentiable local surrogate models. In fields such as physics and engineering, many processes are modeled with non-differentiable simulators with intractable likelihoods. Optimization of these forward models is particularly challenging, especially when the simulator is stochastic. To address such cases, we introduce the use of deep generative models to iteratively approximate the simulator in local neighborhoods of the parameter space. We demonstrate that these local surrogates can be used to approximate the gradient of the simulator, and thus enable gradient-based optimization of simulator parameters. In cases where the dependence of the simulator on the parameter space is constrained to a low dimensional submanifold, we observe that our method attains minima faster than all baseline methods, including Bayesian optimization, numerical optimization, and REINFORCE driven approaches.
Mech-Elites: Illuminating the Mechanic Space of GVGAI
Charity, Megan, Green, Michael Cerny, Khalifa, Ahmed, Togelius, Julian
This paper introduces a fully automatic method of mechanic illumination for general video game level generation. Using the Constrained MAP-Elites algorithm and the GVG-AI framework, this system generates the simplest tile based levels that contain specific sets of game mechanics and also satisfy playability constraints. We apply this method to illuminate mechanic space for $4$ different games in GVG-AI: Zelda, Solarfox, Plants, and RealPortals.
Super-efficiency of automatic differentiation for functions defined as a minimum
Ablin, Pierre, Peyré, Gabriel, Moreau, Thomas
In min-min optimization or max-min optimization, one has to compute the gradient of a function defined as a minimum. In most cases, the minimum has no closed-form, and an approximation is obtained via an iterative algorithm. There are two usual ways of estimating the gradient of the function: using either an analytic formula obtained by assuming exactness of the approximation, or automatic differentiation through the algorithm. In this paper, we study the asymptotic error made by these estimators as a function of the optimization error. We find that the error of the automatic estimator is close to the square of the error of the analytic estimator, reflecting a super-efficiency phenomenon. The convergence of the automatic estimator greatly depends on the convergence of the Jacobian of the algorithm. We analyze it for gradient descent and stochastic gradient descent and derive convergence rates for the estimators in these cases. Our analysis is backed by numerical experiments on toy problems and on Wasserstein barycenter computation. Finally, we discuss the computational complexity of these estimators and give practical guidelines to chose between them.
Submodular Maximization Through Barrier Functions
Badanidiyuru, Ashwinkumar, Karbasi, Amin, Kazemi, Ehsan, Vondrak, Jan
In the constrained continuous optimization, barrier functions are usually used to impose an increasingly large cost on a feasible point as it approaches the boundary of the feasible region [32]. In effect, barrier functions replace constraints by a penalizing term in the primal objective function so that the solution stays away from the boundary of the feasible region. This is an attempt to approximate a constrained optimization problem with an unconstrained one and to later apply standard optimization techniques. While the benefits of barrier functions are studied extensively in the continuous domain [32], their use in discrete optimization is not very well understood. In this paper, we show how discrete barrier functions manifest themselves in constrained submodular maximization. Submodular functions formalize the intuitive diminishing returns condition, a property that not only allows optimization tractability but also appears in many machine learning applications, including video, image, and text summarization [7, 12, 23, 28, 35], active set selection in nonparametric learning [26], sequential decision making [27, 29] sensor placement, information gathering [10], privacy and fairness [16].
Regularized Submodular Maximization at Scale
Kazemi, Ehsan, Minaee, Shervin, Feldman, Moran, Karbasi, Amin
In this paper, we propose scalable methods for maximizing a regularized submodular function $f = g - \ell$ expressed as the difference between a monotone submodular function $g$ and a modular function $\ell$. Indeed, submodularity is inherently related to the notions of diversity, coverage, and representativeness. In particular, finding the mode of many popular probabilistic models of diversity, such as determinantal point processes, submodular probabilistic models, and strongly log-concave distributions, involves maximization of (regularized) submodular functions. Since a regularized function $f$ can potentially take on negative values, the classic theory of submodular maximization, which heavily relies on the non-negativity assumption of submodular functions, may not be applicable. To circumvent this challenge, we develop the first one-pass streaming algorithm for maximizing a regularized submodular function subject to a $k$-cardinality constraint. It returns a solution $S$ with the guarantee that $f(S)\geq(\phi^{-2}-\epsilon) \cdot g(OPT)-\ell (OPT)$, where $\phi$ is the golden ratio. Furthermore, we develop the first distributed algorithm that returns a solution $S$ with the guarantee that $\mathbb{E}[f(S)] \geq (1-\epsilon) [(1-e^{-1}) \cdot g(OPT)-\ell(OPT)]$ in $O(1/ \epsilon)$ rounds of MapReduce computation, without keeping multiple copies of the entire dataset in each round (as it is usually done). We should highlight that our result, even for the unregularized case where the modular term $\ell$ is zero, improves the memory and communication complexity of the existing work by a factor of $O(1/ \epsilon)$ while arguably provides a simpler distributed algorithm and a unifying analysis. We also empirically study the performance of our scalable methods on a set of real-life applications, including finding the mode of distributions, data summarization, and product recommendation.
Learning High Order Feature Interactions with Fine Control Kernels
Paskov, Hristo, Paskov, Alex, West, Robert
We provide a methodology for learning sparse statistical models that use as features all possible multiplicative interactions among an underlying atomic set of features. While the resulting optimization problems are exponentially sized, our methodology leads to algorithms that can often solve these problems exactly or provide approximate solutions based on combining highly correlated features. We also introduce an algorithmic paradigm, the Fine Control Kernel framework, so named because it is based on Fenchel Duality and is reminiscent of kernel methods. Its theory is tailored to large sparse learning problems, and it leads to efficient feature screening rules for interactions. These rules are inspired by the Apriori algorithm for market basket analysis -- which also falls under the purview of Fine Control Kernels, and can be applied to a plurality of learning problems including the Lasso and sparse matrix estimation. Experiments on biomedical datasets demonstrate the efficacy of our methodology in deriving algorithms that efficiently produce interactions models which achieve state-of-the-art accuracy and are interpretable.
Bi-objective Optimization of Biclustering with Binary Data
Glover, Fred, Hanafi, Said, Palubeckis, Gintaras
Clustering consists of partitioning data objects into subsets called clusters according to some similarity criteria. This paper addresses a generalization called quasi-clustering that allows overlapping of clusters, and which we link to biclustering. Biclustering simultaneously groups the objects and features so that a specific group of objects has a special group of features. In recent years, biclustering has received a lot of attention in several practical applications. In this paper we consider a bi-objective optimization of biclustering problem with binary data. First we present an integer programing formulations for the bi-objective optimization biclustering. Next we propose a constructive heuristic based on the set intersection operation and its efficient implementation for solving a series of mono-objective problems used inside the Epsilon-constraint method (obtained by keeping only one objective function and the other objective function is integrated into constraints). Finally, our experimental results show that using CPLEX solver as an exact algorithm for finding an optimal solution drastically increases the computational cost for large instances, while our proposed heuristic provides very good results and significantly reduces the computational expense.
On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions
Arjevani, Yossi, Daniely, Amit, Jegelka, Stefanie, Lin, Hongzhou
Recent advances in randomized incremental methods for minimizing $L$-smooth $\mu$-strongly convex finite sums have culminated in tight complexity of $\tilde{O}((n+\sqrt{n L/\mu})\log(1/\epsilon))$ and $O(n+\sqrt{nL/\epsilon})$, where $\mu>0$ and $\mu=0$, respectively, and $n$ denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge of which individual function is being addressed at each iteration, and as such, must perform at least $\Omega(n^2)$ iterations to obtain $O(1/n^2)$-optimal solutions. In this work, we exploit the finite noise structure of finite sums to derive a matching $O(n^2)$-upper bound under the global oracle model, showing that this lower bound is indeed tight. Following a similar approach, we propose a novel adaptation of SVRG which is both \emph{compatible with stochastic oracles}, and achieves complexity bounds of $\tilde{O}((n^2+n\sqrt{L/\mu})\log(1/\epsilon))$ and $O(n\sqrt{L/\epsilon})$, for $\mu>0$ and $\mu=0$, respectively. Our bounds hold w.h.p. and match in part existing lower bounds of $\tilde{\Omega}(n^2+\sqrt{nL/\mu}\log(1/\epsilon))$ and $\tilde{\Omega}(n^2+\sqrt{nL/\epsilon})$, for $\mu>0$ and $\mu=0$, respectively.
Projective Preferential Bayesian Optimization
Mikkola, Petrus, Todorović, Milica, Järvi, Jari, Rinke, Patrick, Kaski, Samuel
Bayesian optimization is an effective method for finding extrema of a black-box function. We propose a new type of Bayesian optimization for learning user preferences in high-dimensional spaces. The central assumption is that the underlying objective function cannot be evaluated directly, but instead a minimizer along a projection can be queried, which we call a projective preferential query. The form of the query allows for feedback that is natural for a human to give, and which enables interaction. This is demonstrated in a user experiment in which the user feedback comes in the form of optimal position and orientation of a molecule adsorbing to a surface. We demonstrate that our framework is able to find a global minimum of a high-dimensional black-box function, which is an infeasible task for existing preferential Bayesian optimization frameworks that are based on pairwise comparisons.