Goto

Collaborating Authors

 Optimization


Reinforced Few-Shot Acquisition Function Learning for Bayesian Optimization

arXiv.org Machine Learning

Bayesian optimization (BO) conventionally relies on handcrafted acquisition functions (AFs) to sequentially determine the sample points. However, it has been widely observed in practice that the best-performing AF in terms of regret can vary significantly under different types of black-box functions. It has remained a challenge to design one AF that can attain the best performance over a wide variety of black-box functions. This paper aims to attack this challenge through the perspective of reinforced few-shot AF learning (FSAF). Specifically, we first connect the notion of AFs with Q-functions and view a deep Q-network (DQN) as a surrogate differentiable AF. While it serves as a natural idea to combine DQN and an existing few-shot learning method, we identify that such a direct combination does not perform well due to severe overfitting, which is particularly critical in BO due to the need of a versatile sampling policy. To address this, we present a Bayesian variant of DQN with the following three features: (i) It learns a distribution of Q-networks as AFs based on the Kullback-Leibler regularization framework. This inherently provides the uncertainty required in sampling for BO and mitigates overfitting. (ii) For the prior of the Bayesian DQN, we propose to use a demo policy induced by an off-the-shelf AF for better training stability. (iii) On the meta-level, we leverage the meta-loss of Bayesian model-agnostic meta-learning, which serves as a natural companion to the proposed FSAF. Moreover, with the proper design of the Q-networks, FSAF is general-purpose in that it is agnostic to the dimension and the cardinality of the input domain. Through extensive experiments, we demonstrate that the FSAF achieves comparable or better regrets than the state-of-the-art benchmarks on a wide variety of synthetic and real-world test functions.


Unbalanced Optimal Transport through Non-negative Penalized Linear Regression

arXiv.org Machine Learning

This paper addresses the problem of Unbalanced Optimal Transport (UOT) in which the marginal conditions are relaxed (using weighted penalties in lieu of equality) and no additional regularization is enforced on the OT plan. In this context, we show that the corresponding optimization problem can be reformulated as a non-negative penalized linear regression problem. This reformulation allows us to propose novel algorithms inspired from inverse problems and nonnegative matrix factorization. In particular, we consider majorization-minimization which leads in our setting to efficient multiplicative updates for a variety of penalties. Furthermore, we derive for the first time an efficient algorithm to compute the regularization path of UOT with quadratic penalties. The proposed algorithm provides a continuity of piece-wise linear OT plans converging to the solution of balanced OT (corresponding to infinite penalty weights). We perform several numerical experiments on simulated and real data illustrating the new algorithms, and provide a detailed discussion about more sophisticated optimization tools that can further be used to solve OT problems thanks to our reformulation.


An Online Riemannian PCA for Stochastic Canonical Correlation Analysis

arXiv.org Machine Learning

We present an efficient stochastic algorithm (RSG+) for canonical correlation analysis (CCA) using a reparametrization of the projection matrices. We show how this reparametrization (into structured matrices), simple in hindsight, directly presents an opportunity to repurpose/adjust mature techniques for numerical optimization on Riemannian manifolds. Our developments nicely complement existing methods for this problem which either require $O(d^3)$ time complexity per iteration with $O(\frac{1}{\sqrt{t}})$ convergence rate (where $d$ is the dimensionality) or only extract the top $1$ component with $O(\frac{1}{t})$ convergence rate. In contrast, our algorithm offers a strict improvement for this classical problem: it achieves $O(d^2k)$ runtime complexity per iteration for extracting the top $k$ canonical components with $O(\frac{1}{t})$ convergence rate. While the paper primarily focuses on the formulation and technical analysis of its properties, our experiments show that the empirical behavior on common datasets is quite promising. We also explore a potential application in training fair models where the label of protected attribute is missing or otherwise unavailable.


Meta Learning for Knowledge Distillation

arXiv.org Artificial Intelligence

We present Meta Learning for Knowledge Distillation (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn to better transfer knowledge to the student network (i.e., learning to teach) with the feedback from the performance of the distilled student network in a meta learning framework. Moreover, we introduce a pilot update mechanism to improve the alignment between the inner-learner and meta-learner in meta learning algorithms that focus on an improved inner-learner. Experiments on various benchmarks show that MetaDistil can yield significant improvements compared with traditional KD algorithms and is less sensitive to the choice of different student capacity and hyperparameters, facilitating the use of KD on different tasks and models. With the prevalence of large neural networks with millions or billions of parameters, model compression is gaining prominence for facilitating efficient, eco-friendly deployment for machine learning applications. Previous works often train a large model as the "teacher"; then they fix the teacher and train a "student" model to mimic the behavior of the teacher, in order to transfer the knowledge from the teacher to the student. However, this paradigm has the following drawbacks: (1) The teacher is unaware of the student. Recent studies in pedagogy suggest student-centered learning, which considers students' characteristics and learning capability, has shown effectiveness improving students' performance (Cornelius-White, 2007; Wright, 2011).


Supervised Machine Learning with Plausible Deniability

arXiv.org Artificial Intelligence

We study the question of how well machine learning (ML) models trained on a certain data set provide privacy for the training data, or equivalently, whether it is possible to reverse-engineer the training data from a given ML model. While this is easy to answer negatively in the most general case, it is interesting to note that the protection extends over non-recoverability towards plausible deniability: Given an ML model $f$, we show that one can take a set of purely random training data, and from this define a suitable ``learning rule'' that will produce a ML model that is exactly $f$. Thus, any speculation about which data has been used to train $f$ is deniable upon the claim that any other data could have led to the same results. We corroborate our theoretical finding with practical examples, and open source implementations of how to find the learning rules for a chosen set of raining data.


Provably Faster Algorithms for Bilevel Optimization

arXiv.org Machine Learning

Bilevel optimization has been widely applied in many important machine learning applications such as hyperparameter optimization and meta-learning. Recently, several momentum-based algorithms have been proposed to solve bilevel optimization problems faster. However, those momentum-based algorithms do not achieve provably better computational complexity than $\mathcal{O}(\epsilon^{-2})$ of the SGD-based algorithm. In this paper, we propose two new algorithms for bilevel optimization, where the first algorithm adopts momentum-based recursive iterations, and the second algorithm adopts recursive gradient estimations in nested loops to decrease the variance. We show that both algorithms achieve the complexity of $\mathcal{O}(\epsilon^{-1.5})$, which outperforms all existing algorithms by the order of magnitude. Our experiments validate our theoretical results and demonstrate the superior empirical performance of our algorithms in hyperparameter applications. Our codes for MRBO, VRBO and other benchmarks are available $\text{online}^1$.


Bayesian Optimization over Hybrid Spaces

arXiv.org Machine Learning

We consider the problem of optimizing hybrid structures (mixture of discrete and continuous input variables) via expensive black-box function evaluations. This problem arises in many real-world applications. For example, in materials design optimization via lab experiments, discrete and continuous variables correspond to the presence/absence of primitive elements and their relative concentrations respectively. The key challenge is to accurately model the complex interactions between discrete and continuous variables. In this paper, we propose a novel approach referred as Hybrid Bayesian Optimization (HyBO) by utilizing diffusion kernels, which are naturally defined over continuous and discrete variables. We develop a principled approach for constructing diffusion kernels over hybrid spaces by utilizing the additive kernel formulation, which allows additive interactions of all orders in a tractable manner. We theoretically analyze the modeling strength of additive hybrid kernels and prove that it has the universal approximation property. Our experiments on synthetic and six diverse real-world benchmarks show that HyBO significantly outperforms the state-of-the-art methods.


Automatically Differentiable Random Coefficient Logistic Demand Estimation

arXiv.org Machine Learning

The random coefficient logistic demand model of Berry et al. (1995) (henceforth BLP) has been a workhorse of the New Empirical Industrial Organization literature, allowing for varied substitution patterns across products, and accounting for endogeneity of price. The reliability of its estimation has been the subject of rigorous debate (Nevo, 2000; Conlon and Gortmaker, 2020; Knittel and Metaxoglou, 2014), and the estimator itself has been the study of many proposed advances in econometric techniques as a sophisticated yet widely used structural model (Hong et al., 2020; Forneron and Ng, 2020). The most common implementation of the BLP estimator involves the use of a nested fixed point (NFP) as an inner loop within an outer loop of GMM estimation, although we acknowledge the Mathematical Programming with Equilibrium Constraints (MPEC) approach of Dubé et al. (2012), which is beyond the scope of this paper. Dubé et al. (2012) and Conlon and Gortmaker (2020) find that derivative-free optimization algorithms such as the Nelder-Meade or simplex algorithms often fail to converge or converge to the wrong solution. As such, the literature has settled on the use of analytical derivatives with a derivative-based optimization algorithm such as L-BFGS. Nevo (2000) provides the analytical derivative for demand-only (DO) BLP in detail, and Conlon and Gortmaker (2020) indicate that the same is possible for demand-and-supply (DS) BLP, although it involves tensor products.


Linear Programming in Data Science: College/University Level

#artificialintelligence

How to become a pro in Linear Programming for Data Science? In this course, you will learn all about the mathematical optimization of linear programming in data science. This course is very unique and has its own importance in its respective disciplines. Data science and business study heavily rely on optimization. Optimization is the study of analysis and interpreting mathematical data under special rules and formulas.


On-Line Policy Iteration for Infinite Horizon Dynamic Programming

#artificialintelligence

In this paper we propose an on-line policy iteration (PI) algorithm for finite-state infinite horizon discounted dynamic programming, whereby the policy improvement operation is done on-line, only for the states that are encountered during operation of the system. This allows the continuous updating/improvement of the current policy, thus resulting in a form of on-line PI that incorporates the improved controls into the current policy as new states and controls are generated. The algorithm converges in a finite number of stages to a type of locally optimal policy, and suggests the possibility of variants of PI and multiagent PI where the policy improvement is simplified. Moreover, the algorithm can be used with on-line replanning, and is also well-suited for on-line PI algorithms with value and policy approximations.