Optimization
Communication-efficient Algorithm for Distributed Sparse Learning via Two-way Truncation
We propose a communicationally and computationally efficient algorithm for high-dimensional distributed sparse learning. At each iteration, local machines compute the gradient on local data and the master machine solves one shifted $l_1$ regularized minimization problem. The communication cost is reduced from constant times of the dimension number for the state-of-the-art algorithm to constant times of the sparsity number via Two-way Truncation procedure. Theoretically, we prove that the estimation error of the proposed algorithm decreases exponentially and matches that of the centralized method under mild assumptions. Extensive experiments on both simulated data and real data verify that the proposed algorithm is efficient and has performance comparable with the centralized method on solving high-dimensional sparse learning problems.
A Brief Introduction to Machine Learning for Engineers
Department of Informatics, King's College London; osvaldo.simeone@kcl.ac.uk ABSTRACT This monograph aims at providing an introduction to key concepts, algorithms, and theoretical frameworks in machine learning, including supervised and unsupervised learning, statistical learning theory, probabilistic graphical models and approximate inference. The intended readership consists of electrical engineers with a background in probability and linear algebra. The treatment builds on first principles, and organizes the main ideas according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approximate inference, directed and undirected models, and convex and non-convex optimization. The mathematical framework uses information-theoretic measures as a unifying tool. The text offers simple and reproducible numerical examples providing insights into key motivations and conclusions. Rather than providing exhaustive details on the existing myriad solutions in each specific category, for which the reader is referred to textbooks and papers, this monograph is meant as an entry point for an engineer into the literature on machine learning.
A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds
Joulani, Pooria, György, András, Szepesvári, Csaba
Recently, much work has been done on extending the scope of online learning and incremental stochastic optimization algorithms. In this paper we contribute to this effort in two ways: First, based on a new regret decomposition and a generalization of Bregman divergences, we provide a self-contained, modular analysis of the two workhorses of online learning: (general) adaptive versions of Mirror Descent (MD) and the Follow-the-Regularized-Leader (FTRL) algorithms. The analysis is done with extra care so as not to introduce assumptions not needed in the proofs and allows to combine, in a straightforward way, different algorithmic ideas (e.g., adaptivity, optimism, implicit updates) and learning settings (e.g., strongly convex or composite objectives). This way we are able to reprove, extend and refine a large body of the literature, while keeping the proofs concise. The second contribution is a byproduct of this careful analysis: We present algorithms with improved variational bounds for smooth, composite objectives, including a new family of optimistic MD algorithms with only one projection step per round. Furthermore, we provide a simple extension of adaptive regret bounds to practically relevant non-convex problem settings with essentially no extra effort.
Entropic Determinants
Granziol, Diego, Roberts, Stephen
The ability of many powerful machine learning algorithms to deal with large data sets without compromise is often hampered by computationally expensive linear algebra tasks, of which calculating the log determinant is a canonical example. In this paper we demonstrate the optimality of Maximum Entropy methods in approximating such calculations. We prove the equivalence between mean value constraints and sample expectations in the big data limit, that Covariance matrix eigenvalue distributions can be completely defined by moment information and that the reduction of the self entropy of a maximum entropy proposal distribution, achieved by adding more moments reduces the KL divergence between the proposal and true eigenvalue distribution. We empirically verify our results on a variety of SparseSuite matrices and establish best practices.
Faster independent component analysis by preconditioning with Hessian approximations
Ablin, Pierre, Cardoso, Jean-François, Gramfort, Alexandre
Independent Component Analysis (ICA) is a technique for unsupervised exploration of multi-channel data that is widely used in observational sciences. In its classic form, ICA relies on modeling the data as linear mixtures of non-Gaussian independent sources. The maximization of the corresponding likelihood is a challenging problem if it has to be completed quickly and accurately on large sets of real data. We introduce the Preconditioned ICA for Real Data (Picard) algorithm, which is a relative L-BFGS algorithm preconditioned with sparse Hessian approximations. Extensive numerical comparisons to several algorithms of the same class demonstrate the superior performance of the proposed technique, especially on real data, for which the ICA model does not necessarily hold.
Transfer Learning for Performance Modeling of Configurable Systems: An Exploratory Analysis
Jamshidi, Pooyan, Siegmund, Norbert, Velez, Miguel, Kästner, Christian, Patel, Akshay, Agarwal, Yuvraj
Highly configurable software systems, such as mobile apps, compilers, and big data engines, are increasingly exposed to end users and developers on a daily basis for varying use cases. Users are interested not only in the fastest configuration, but also in whether the fastest configuration for their applications also remains the fastest when the environmental situation has been changed. For instance, a mobile developer might be interested to know if the software that she has configured to consume minimal energy on a testing platform will also remain energy efficient on the users' mobile platform; or, in general, whether the configuration will remain optimal when the software is used in a different environment (e.g., with a different workload, on different hardware). Performance models have been extensively used to learn and describe the performance behavior of configurable systems [15], [19], [21], [23], [33], [43]-[45], [54], [61], [63]. However, the exponentially growing configuration space, complex interactions, and unknown constraints among configuration options [56] often make it costly and difficult to learn an accurate and reliable performance model. Even worse, existing techniques usually consider only a fixed environment (e.g., fixed workload, fixed hardware, fixed versions of the dependent libraries); should that environment change, a new performance model may need to be learned from scratch. This strong assumption limits the reusability of performance models across environments.
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Salimans, Tim, Ho, Jonathan, Chen, Xi, Sidor, Szymon, Sutskever, Ilya
We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, our ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers. This allows us to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.
Simplified Energy Landscape for Modularity Using Total Variation
Boyd, Zachary, Bae, Egil, Tai, Xue-Cheng, Bertozzi, Andrea L.
Networks capture pairwise interactions between entities and are frequently used in applications such as social networks, food networks, and protein interaction networks, to name a few. Communities, cohesive groups of nodes, often form in these applications, and identifying them gives insight into the overall organization of the network. One common quality function used to identify community structure is modularity. In Hu et al. [SIAM J. App. Math., 73(6), 2013], it was shown that modularity optimization is equivalent to minimizing a particular nonconvex total variation (TV) based functional over a discrete domain. They solve this problem---assuming the number of communities is known---using a Merriman, Bence, Osher (MBO) scheme. We show that modularity optimization is equivalent to minimizing a convex TV-based functional over a discrete domain---again, assuming the number of communities is known. Furthermore, we show that modularity has no convex relaxation satisfying certain natural conditions. Despite this, we partially relax the discrete constraint using a Ginzburg Landau functional, yielding an optimization problem that is more nearly convex. We then derive an MBO algorithm with fewer parameters than in Hu et al. and which is 7 times faster at solving the associated diffusion equation due to the fact that the underlying discretization is unconditionally stable. Our numerical tests include a hyperspectral video whose associated graph has 29 million edges, which is roughly 37 times larger than was handled in the paper of Hu et al.
Fast Rates for Bandit Optimization with Upper-Confidence Frank-Wolfe
Berthet, Quentin, Perchet, Vianney
We consider the problem of bandit optimization, inspired by stochastic optimization and online learning problems with bandit feedback. In this problem, the objective is to minimize a global loss function of all the actions, not necessarily a cumulative loss. This framework allows us to study a very general class of problems, with applications in statistics, machine learning, and other fields. To solve this problem, we analyze the Upper-Confidence Frank-Wolfe algorithm, inspired by techniques for bandits and convex optimization. We give theoretical guarantees for the performance of this algorithm over various classes of functions, and discuss the optimality of these results.
Recovery Conditions and Sampling Strategies for Network Lasso
Mara, Alexandru, Jung, Alexander
The network Lasso is a recently proposed convex optimization method for machine learning from massive network structured datasets, i.e., big data over networks. It is a variant of the well-known least absolute shrinkage and selection operator (Lasso), which is underlying many methods in learning and signal processing involving sparse models. Highly scalable implementations of the network Lasso can be obtained by state-of-the art proximal methods, e.g., the alternating direction method of multipliers (ADMM). By generalizing the concept of the compatibility condition put forward by van de Geer and Buehlmann as a powerful tool for the analysis of plain Lasso, we derive a sufficient condition, i.e., the network compatibility condition, on the underlying network topology such that network Lasso accurately learns a clustered underlying graph signal. This network compatibility condition relates the location of the sampled nodes with the clustering structure of the network. In particular, the NCC informs the choice of which nodes to sample, or in machine learning terms, which data points provide most information if labeled.