Optimization
Minimum Regret Search for Single- and Multi-Task Optimization
We propose minimum regret search (MRS), a novel acquisition function for Bayesian optimization. MRS bears similarities with information-theoretic approaches such as entropy search (ES). However, while ES aims in each query at maximizing the information gain with respect to the global maximum, MRS aims at minimizing the expected simple regret of its ultimate recommendation for the optimum. While empirically ES and MRS perform similar in most of the cases, MRS produces fewer outliers with high simple regret than ES. We provide empirical results both for a synthetic single-task optimization problem as well as for a simulated multi-task robotic control problem.
A Kronecker-factored approximate Fisher matrix for convolution layers
Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the exact natural gradient is impractical to compute for large models, and most approximations either require an expensive iterative procedure or make crude approximations to the curvature. We present Kronecker Factors for Convolution (KFC), a tractable approximation to the Fisher matrix for convolutional networks based on a structured probabilistic model for the distribution over backpropagated derivatives. Similarly to the recently proposed Kronecker-Factored Approximate Curvature (K-FAC), each block of the approximate Fisher matrix decomposes as the Kronecker product of small matrices, allowing for efficient inversion. KFC captures important curvature information while still yielding comparably efficient updates to stochastic gradient descent (SGD). We show that the updates are invariant to commonly used reparameterizations, such as centering of the activations. In our experiments, approximate natural gradient descent with KFC was able to train convolutional networks several times faster than carefully tuned SGD. Furthermore, it was able to train the networks in 10-20 times fewer iterations than SGD, suggesting its potential applicability in a distributed setting.
Smart broadcasting: Do you want to be seen?
Karimi, Mohammad Reza, Tavakoli, Erfan, Farajtabar, Mehrdad, Song, Le, Gomez-Rodriguez, Manuel
Many users in online social networks are constantly trying to gain attention from their followers by broadcasting posts to them. These broadcasters are likely to gain greater attention if their posts can remain visible for a longer period of time among their followers' most recent feeds. Then when to post? In this paper, we study the problem of smart broadcasting using the framework of temporal point processes, where we model users feeds and posts as discrete events occurring in continuous time. Based on such continuous-time model, then choosing a broadcasting strategy for a user becomes a problem of designing the conditional intensity of her posting events. We derive a novel formula which links this conditional intensity with the visibility of the user in her followers' feeds. Furthermore, by exploiting this formula, we develop an efficient convex optimization framework for the when-to-post problem. Our method can find broadcasting strategies that reach a desired visibility level with provable guarantees. We experimented with data gathered from Twitter, and show that our framework can consistently make broadcasters' post more visible than alternatives.
Online Optimization of Smoothed Piecewise Constant Functions
Cohen-Addad, Vincent, Kanade, Varun
In this paper, we study the problem of online optimization of piecewise constant functions. This is motivated by the question of selecting optimal parameters for learning algorithms. Recently, Gupta and Roughgarden (2016) introduced a probably approximately correct (PAC) framework for choosing parameters of algorithms. Imagine a situation, when a website wishes to provide personalized results to a user. To respond to a user's query, the service provider may need to implement a learning (or some other type of) algorithm which involves choosing parameters. The choice of parameters affects the quality of solution and ideally we would like to design a mechanism where the service provider learns from past instances, or at least employs a strategy that has low regret with respect to the single optimal solution in hindsight. In many learning problems, the goal is to find parameters by optimizing a continuous function (of the parameters); however, ever so often one encounters problems with discrete solutions, such as k-means or independent set, which result in objective functions that have discontinuities. Concretely, we consider the problem of online optimization of piecewise constant functions over the domain [0, 1).
Machine learning outperforms physicists in experiment
Australian physicists have used an online optimization process based on machine learning to produce effective Bose-Einstein condensates (BECs) in a fraction of the time it would normally take the researchers. A BEC is a state of matter of a dilute gas of atoms trapped in a laser beam and cooled to temperatures just above absolute zero. BECs are extremely sensitive to external disturbances, which makes them ideal for research into quantum phenomena or for making very precise measurements such as tiny changes in the Earth's magnetic field or gravity. The experiment, developed by physicists from ANU, University of Adelaide and UNSW ADFA, demonstrated that "machine-learning online optimization" can discover optimized condensation methods "with less experiments than a competing optimization method and provide insight into which parameters are important in achieving condensation," the physicists explain in an open-access paper in the Nature group journal Scientific Reports. Optical dipole trap used in the experiment, showing the three laser beams and the condensate (red-yellow oval in blue square) (credit: P. B. Wigley et al./Scientific Reports) The team cooled the gas to around 5 microkelvin.
Solve-Select-Scale: A Three Step Process For Sparse Signal Estimation
In the theory of compressed sensing (CS), the sparsity $\|x\|_0$ of the unknown signal $\mathbf{x} \in \mathcal{R}^n$ is of prime importance and the focus of reconstruction algorithms has mainly been either $\|x\|_0$ or its convex relaxation (via $\|x\|_1$). However, it is typically unknown in practice and has remained a challenge when nothing about the size of the support is known. As pointed recently, $\|x\|_0$ might not be the best metric to minimize directly, both due to its inherent complexity as well as its noise performance. Recently a novel stable measure of sparsity $s(\mathbf{x}) := \|\mathbf{x}\|_1^2/\|\mathbf{x}\|_2^2$ has been investigated by Lopes \cite{Lopes2012}, which is a sharp lower bound on $\|\mathbf{x}\|_0$. The estimation procedure for this measure uses only a small number of linear measurements, does not rely on any sparsity assumptions, and requires very little computation. The usage of the quantity $s(\mathbf{x})$ in sparse signal estimation problems has not received much importance yet. We develop the idea of incorporating $s(\mathbf{x})$ into the signal estimation framework. We also provide a three step algorithm to solve problems of the form $\mathbf{Ax=b}$ with no additional assumptions on the original signal $\mathbf{x}$.
Convex Optimization for Linear Query Processing under Approximate Differential Privacy
Yuan, Ganzhao, Yang, Yin, Zhang, Zhenjie, Hao, Zhifeng
Differential privacy enables organizations to collect accurate aggregates over sensitive data with strong, rigorous guarantees on individuals' privacy. Previous work has found that under differential privacy, computing multiple correlated aggregates as a batch, using an appropriate \emph{strategy}, may yield higher accuracy than computing each of them independently. However, finding the best strategy that maximizes result accuracy is non-trivial, as it involves solving a complex constrained optimization program that appears to be non-linear and non-convex. Hence, in the past much effort has been devoted in solving this non-convex optimization program. Existing approaches include various sophisticated heuristics and expensive numerical solutions. None of them, however, guarantees to find the optimal solution of this optimization problem. This paper points out that under ($\epsilon$, $\delta$)-differential privacy, the optimal solution of the above constrained optimization problem in search of a suitable strategy can be found, rather surprisingly, by solving a simple and elegant convex optimization program. Then, we propose an efficient algorithm based on Newton's method, which we prove to always converge to the optimal solution with linear global convergence rate and quadratic local convergence rate. Empirical evaluations demonstrate the accuracy and efficiency of the proposed solution.
Online Optimization for Large-Scale Max-Norm Regularization
Max-norm regularizer has been extensively studied in the last decade as it promotes an effective low-rank estimation for the underlying data. However, such max-norm regularized problems are typically formulated and solved in a batch manner, which prevents it from processing big data due to possible memory budget. In this paper, hence, we propose an online algorithm that is scalable to large-scale setting. Particularly, we consider the matrix decomposition problem as an example, although a simple variant of the algorithm and analysis can be adapted to other important problems such as matrix completion. The crucial technique in our implementation is to reformulating the max-norm to an equivalent matrix factorization form, where the factors consist of a (possibly overcomplete) basis component and a coefficients one. In this way, we may maintain the basis component in the memory and optimize over it and the coefficients for each sample alternatively. Since the memory footprint of the basis component is independent of the sample size, our algorithm is appealing when manipulating a large collection of samples. We prove that the sequence of the solutions (i.e., the basis component) produced by our algorithm converges to a stationary point of the expected loss function asymptotically. Numerical study demonstrates encouraging results for the efficacy and robustness of our algorithm compared to the widely used nuclear norm solvers.
Transfer Hashing with Privileged Information
Zhou, Joey Tianyi, Xu, Xinxing, Pan, Sinno Jialin, Tsang, Ivor W., Qin, Zheng, Goh, Rick Siow Mong
Most existing learning to hash methods assume that there are sufficient data, either labeled or unlabeled, on the domain of interest (i.e., the target domain) for training. However, this assumption cannot be satisfied in some real-world applications. To address this data sparsity issue in hashing, inspired by transfer learning, we propose a new framework named Transfer Hashing with Privileged Information (THPI). Specifically, we extend the standard learning to hash method, Iterative Quantization (ITQ), in a transfer learning manner, namely ITQ+. In ITQ+, a new slack function is learned from auxiliary data to approximate the quantization error in ITQ. We developed an alternating optimization approach to solve the resultant optimization problem for ITQ+. We further extend ITQ+ to LapITQ+ by utilizing the geometry structure among the auxiliary data for learning more precise binary codes in the target domain. Extensive experiments on several benchmark datasets verify the effectiveness of our proposed approaches through comparisons with several state-of-the-art baselines.
Empirical Similarity for Absent Data Generation in Imbalanced Classification
When the training data in a two-class classification problem is overwhelmed by one class, most classification techniques fail to correctly identify the data points belonging to the underrepresented class. We propose Similarity-based Imbalanced Classification (SBIC) that learns patterns in the training data based on an empirical similarity function. To take the imbalanced structure of the training data into account, SBIC utilizes the concept of absent data, i.e. data from the minority class which can help better find the boundary between the two classes. SBIC simultaneously optimizes the weights of the empirical similarity function and finds the locations of absent data points. As such, SBIC uses an embedded mechanism for synthetic data generation which does not modify the training dataset, but alters the algorithm to suit imbalanced datasets. Therefore, SBIC uses the ideas of both major schools of thoughts in imbalanced classification: Like cost-sensitive approaches SBIC operates on an algorithm level to handle imbalanced structures; and similar to synthetic data generation approaches, it utilizes the properties of unobserved data points from the minority class. The application of SBIC to imbalanced datasets suggests it is comparable to, and in some cases outperforms, other commonly used classification techniques for imbalanced datasets.