Optimization
CoNES: Convex Natural Evolutionary Strategies
Veer, Sushant, Majumdar, Anirudha
We present a novel algorithm -- convex natural evolutionary strategies (CoNES) -- for optimizing high-dimensional blackbox functions by leveraging tools from convex optimization and information geometry. CoNES is formulated as an efficiently-solvable convex program that adapts the evolutionary strategies (ES) gradient estimate to promote rapid convergence. The resulting algorithm is invariant to the parameterization of the belief distribution. Our numerical results demonstrate that CoNES vastly outperforms conventional blackbox optimization methods on a suite of functions used for benchmarking blackbox optimizers. Furthermore, CoNES demonstrates the ability to converge faster than conventional blackbox methods on a selection of OpenAI's MuJoCo reinforcement learning tasks for locomotion.
$p$-Norm Flow Diffusion for Local Graph Clustering
Fountoulakis, Kimon, Wang, Di, Yang, Shenghao
Local graph clustering and the closely related seed set expansion problem are primitives on graphs that are central to a wide range of analytic and learning tasks such as local clustering, community detection, nodes ranking and feature inference. Prior work on local graph clustering mostly falls into two categories with numerical and combinatorial roots respectively. In this work, we draw inspiration from both fields and propose a family of convex optimization formulations based on the idea of diffusion with p-norm network flow for $p\in (1,\infty)$. In the context of local clustering, we characterize the optimal solutions for these optimization problems and show their usefulness in finding low conductance cuts around input seed set. In particular, we achieve quadratic approximation of conductance in the case of $p=2$ similar to the Cheeger-type bounds of spectral methods, constant factor approximation when $p\rightarrow\infty$ similar to max-flow based methods, and a smooth transition for general $p$ values in between. Thus, our optimization formulation can be viewed as bridging the numerical and combinatorial approaches, and we can achieve the best of both worlds in terms of speed and noise robustness. We show that the proposed problem can be solved in strongly local running time for $p\ge 2$ and conduct empirical evaluations on both synthetic and real-world graphs to illustrate our approach compares favorably with existing methods.
A Novel Column Generation Heuristic for Airline Crew Pairing Optimization with Large-scale Complex Flight Networks
Aggarwal, Divyam, Saxena, Dhish Kumar, Bรคck, Thomas, Emmerich, Michael
For an airline, the crew operating cost is second only to the fuel cost, making the crew pairing optimization (CPO) critical for business viability. Its aim is to generate a set of flight sequences (crew pairings) that cover all flights in an airline's schedule, at minimum cost, while satisfying several legality constraints. Being an NP-hard combinatorial optimization problem, CPO is tackled by relaxing the underlying Integer Programming Problem into a Linear Programming Problem, and solving the latter through Column generation (CG) technique. However, with the expansion of airlines' operations lately, the curse of dimensionality renders the exact CG-implementations obsolete, paving the way for heuristic-based CG-implementations. Yet, the much prevalent large-scale complex flight networks involving multiple-crew bases and hub-and-spoke sub-networks, largely remain unaddressed. To bridge the research gap, this paper proposes a novel CG heuristic, which has enabled in-house development of an Airline Crew Pairing Optimizer (AirCROP). The efficacy of the heuristic/AirCROP has been: (a) tested on real-world airline data with an unprecedented conjunct scale-and-complexity, marked by over 4200 flights, 15 crew bases, and over a billion pairings, and (b) validated by the research consortium's industrial sponsor. This paper has a dedicated focus on the proposed CG heuristic which constitutes the core search mechanism of the optimizer, by balancing random exploration (of pairings' space), exploitation of domain knowledge (on optimal solution's features), and utilization of the past computational effort through archiving. Though this paper has an airline context, the underlying propositions may find applications across different domains as the proposed CG heuristic can serve as a template on how to utilize domain knowledge to better tackle large-scale combinatorial optimization problems.
A Flexible Optimization Framework for Regularized Matrix-Tensor Factorizations with Linear Couplings
Schenker, Carla, Cohen, Jeremy E., Acar, Evrim
In many areas of science, various sensing technologies are used to obtain information about a single system of interest. Often, none of the datasets alone contains a complete view of the system, but the data measured from different modalities can complement each other. For instance, brain activity patterns can be captured using both electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) signals, which have complementary temporal and spatial resolutions. Similarly, in metabolomics, multiple analytical techniques such as LCMS (Liquid Chromatography - Mass Spectrometry) and NMR (Nuclear Magnetic Resonance) spectroscopy are used to measure chemical compounds in biological samples, providing a more complete picture of underlying biological processes. Joint analysis of datasets from multiple sources, also referred to as data fusion (or multi-modal data mining), exploits these complementary measurements, and allows for better interpretability and, potentially, more accurate recovery of patterns characterizing the underlying phenomena. Nevertheless, data fusion poses many challenges, and there is an emerging need for data fusion methods that can take into account different characteristics of data from multiple sources in many disciplines [1-4]. Data from multiple sources can often be represented in the form of matrices and higher-order tensors. Coupled matrix and tensor factorizations (CMTF) are an effective approach for joint analysis of such datasets in many domains including social network analysis [5-8], neuroscience [9-13], and chemometrics [2, 14]. In such coupled factorizations, each dataset is modelled by a low-rank approximation.
On a Novel Application of Wasserstein-Procrustes for Unsupervised Cross-Lingual Learning
Ramรญrez, Guillem, Dangovski, Rumen, Nakov, Preslav, Soljaฤiฤ, Marin
The emergence of unsupervised word embeddings, pre-trained on very large monolingual text corpora, is at the core of the ongoing neural revolution in Natural Language Processing (NLP). Initially introduced for English, such pre-trained word embeddings quickly emerged for a number of other languages. Subsequently, there have been a number of attempts to align the embedding spaces across languages, which could enable a number of cross-language NLP applications. Performing the alignment using unsupervised cross-lingual learning (UCL) is especially attractive as it requires little data and often rivals supervised and semi-supervised approaches. Here, we analyze popular methods for UCL and we find that often their objectives are, intrinsically, versions of the Wasserstein-Procrustes problem. Hence, we devise an approach to solve Wasserstein-Procrustes in a direct way, which can be used to refine and to improve popular UCL methods such as iterative closest point (ICP), multilingual unsupervised and supervised embeddings (MUSE) and supervised Procrustes methods. Our evaluation experiments on standard datasets show sizable improvements over these approaches. We believe that our rethinking of the Wasserstein-Procrustes problem could enable further research, thus helping to develop better algorithms for aligning word embeddings across languages. Our code and instructions to reproduce the experiments are available at https://github.com/guillemram97/wp-hungarian.
Distributed Learning via Filtered Hyperinterpolation on Manifolds
Montรบfar, Guido, Wang, Yu Guang
Learning mappings of data on manifolds is an important topic in contemporary machine learning, with applications in astrophysics, geophysics, statistical physics, medical diagnosis, biochemistry, 3D object analysis. This paper studies the problem of learning real-valued functions on manifolds through filtered hyperinterpolation of input-output data pairs where the inputs may be sampled deterministically or at random and the outputs may be clean or noisy. Motivated by the problem of handling large data sets, it presents a parallel data processing approach which distributes the data-fitting task among multiple servers and synthesizes the fitted sub-models into a global estimator. We prove quantitative relations between the approximation quality of the learned function over the entire manifold, the type of target function, the number of servers, and the number and type of available samples. We obtain the approximation rates of convergence for distributed and non-distributed approaches. For the non-distributed case, the approximation order is optimal.
From Symmetry to Geometry: Tractable Nonconvex Problems
Zhang, Yuqian, Qu, Qing, Wright, John
As science and engineering have become increasingly data-driven, the role of optimization has expanded to touch almost every stage of the data analysis pipeline, from the signal and data acquisition to modeling and prediction. The optimization problems encountered in practice are often nonconvex. While challenges vary from problem to problem, one common source of nonconvexity is nonlinearity in the data or measurement model. Nonlinear models often exhibit symmetries, creating complicated, nonconvex objective landscapes, with multiple equivalent solutions. Nevertheless, simple methods (e.g., gradient descent) often perform surprisingly well in practice. The goal of this survey is to highlight a class of tractable nonconvex problems, which can be understood through the lens of symmetries. These problems exhibit a characteristic geometric structure: local minimizers are symmetric copies of a single ``ground truth'' solution, while other critical points occur at balanced superpositions of symmetric copies of the ground truth, and exhibit negative curvature in directions that break the symmetry. This structure enables efficient methods to obtain global minimizers. We discuss examples of this phenomenon arising from a wide range of problems in imaging, signal processing, and data analysis. We highlight the key role of symmetry in shaping the objective landscape and discuss the different roles of rotational and discrete symmetries. This area is rich with observed phenomena and open problems; we close by highlighting directions for future research.
Cumulant GAN
Pantazis, Yannis, Paul, Dipjyoti, Fasoulakis, Michail, Stylianou, Yannis, Katsoulakis, Markos
In this paper, we propose a novel loss function for training Generative Adversarial Networks (GANs) aiming towards deeper theoretical understanding as well as improved performance for the underlying optimization problem. The new loss function is based on cumulant generating functions giving rise to \emph{Cumulant GAN}. Relying on a recently-derived variational formula, we show that the corresponding optimization problem is equivalent to R{\'e}nyi divergence minimization, thus offering a (partially) unified perspective of GAN losses: the R{\'e}nyi family encompasses Kullback-Leibler divergence (KLD), reverse KLD, Hellinger distance and $\chi^2$-divergence. Wasserstein GAN is also a member of the proposed cumulant GAN. In terms of stability, we rigorously prove the exponential convergence of cumulant GAN to the Nash equilibrium for a linear discriminator, Gaussian distributions and the standard gradient descent algorithm. Finally, we experimentally demonstrate that image generation is generally more robust relative to Wasserstein GAN and it is substantially improved in terms of inception score when weaker discriminators are considered.
One Size Fits All: Can We Train One Denoiser for All Noise Levels?
Gnansambandam, Abhiram, Chan, Stanley H.
When training an estimator such as a neural network for tasks like image denoising, it is often preferred to train one estimator and apply it to all noise levels. The de facto training protocol to achieve this goal is to train the estimator with noisy samples whose noise levels are uniformly distributed across the range of interest. However, why should we allocate the samples uniformly? Can we have more training samples that are less noisy, and fewer samples that are more noisy? What is the optimal distribution? How do we obtain such a distribution? The goal of this paper is to address this training sample distribution problem from a minimax risk optimization perspective. We derive a dual ascent algorithm to determine the optimal sampling distribution of which the convergence is guaranteed as long as the set of admissible estimators is closed and convex. For estimators with non-convex admissible sets such as deep neural networks, our dual formulation converges to a solution of the convex relaxation. We discuss how the algorithm can be implemented in practice. We evaluate the algorithm on linear estimators and deep networks.
Comparator-adaptive Convex Bandits
van der Hoeven, Dirk, Cutkosky, Ashok, Luo, Haipeng
We study bandit convex optimization methods that adapt to the norm of the comparator, a topic that has only been studied before for its full-information counterpart. Specifically, we develop convex bandit algorithms with regret bounds that are small whenever the norm of the comparator is small. We first use techniques from the full-information setting to develop comparator-adaptive algorithms for linear bandits. Then, we extend the ideas to convex bandits with Lipschitz or smooth loss functions, using a new single-point gradient estimator and carefully designed surrogate losses.