Goto

Collaborating Authors

 Optimization


RelaySum for Decentralized Deep Learning on Heterogeneous Data

arXiv.org Machine Learning

In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers. A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions. To tackle this challenge, we introduce the RelaySum mechanism for information propagation in decentralized learning. RelaySum uses spanning trees to distribute information exactly uniformly across all workers with finite delays depending on the distance between nodes. In contrast, the typical gossip averaging mechanism only distributes data uniformly asymptotically while using the same communication volume per step as RelaySum. We prove that RelaySGD, based on this mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data. Our code is available at http://github.com/epfml/relaysgd.


A Regularized Wasserstein Framework for Graph Kernels

arXiv.org Machine Learning

We propose a learning framework for graph kernels, which is theoretically grounded on regularizing optimal transport. This framework provides a novel optimal transport distance metric, namely Regularized Wasserstein (RW) discrepancy, which can preserve both features and structure of graphs via Wasserstein distances on features and their local variations, local barycenters and global connectivity. Two strongly convex regularization terms are introduced to improve the learning ability. One is to relax an optimal alignment between graphs to be a cluster-to-cluster mapping between their locally connected vertices, thereby preserving the local clustering structure of graphs. The other is to take into account node degree distributions in order to better preserve the global structure of graphs. We also design an efficient algorithm to enable a fast approximation for solving the optimization problem. Theoretically, our framework is robust and can guarantee the convergence and numerical stability in optimization. We have empirically validated our method using 12 datasets against 16 state-of-the-art baselines. The experimental results show that our method consistently outperforms all state-of-the-art methods on all benchmark databases for both graphs with discrete attributes and graphs with continuous attributes.


Scaling Bayesian Optimization With Game Theory

arXiv.org Machine Learning

We introduce the algorithm Bayesian Optimization (BO) with Fictitious Play (BOFiP) for the optimization of high dimensional black box functions. BOFiP decomposes the original, high dimensional, space into several sub-spaces defined by non-overlapping sets of dimensions. These sets are randomly generated at the start of the algorithm, and they form a partition of the dimensions of the original space. BOFiP searches the original space with alternating BO, within sub-spaces, and information exchange among sub-spaces, to update the sub-space function evaluation. The basic idea is to distribute the high dimensional optimization across low dimensional sub-spaces, where each sub-space is a player in an equal interest game. At each iteration, BO produces approximate best replies that update the players belief distribution. The belief update and BO alternate until a stopping condition is met. High dimensional problems are common in real applications, and several contributions in the BO literature have highlighted the difficulty in scaling to high dimensions due to the computational complexity associated to the estimation of the model hyperparameters. Such complexity is exponential in the problem dimension, resulting in substantial loss of performance for most techniques with the increase of the input dimensionality. We compare BOFiP to several state-of-the-art approaches in the field of high dimensional black box optimization. The numerical experiments show the performance over three benchmark objective functions from 20 up to 1000 dimensions. A neural network architecture design problem is tested with 42 up to 911 nodes in 6 up to 92 layers, respectively, resulting into networks with 500 up to 10,000 weights. These sets of experiments empirically show that BOFiP outperforms its competitors, showing consistent performance across different problems and increasing problem dimensionality.


$\bar{G}_{mst}$:An Unbiased Stratified Statistic and a Fast Gradient Optimization Algorithm Based on It

arXiv.org Machine Learning

It is difficult to optimize a giant model with deep and wider layers. Similar to most optimization algorithms, training a deep model with gradient method (SGD-like Algorithms) has disadvantages such as easy to fall into local minima or saddle point and slow convergence speed. There have been a lot of researches on the improvement of the gradient method, and a considerable part of these researches focus on how to refine the search direction while keeping the iteration cost as low as possible to accelerate the convergence of the algorithm[10, 11, 12, 13, 14, 15, 16]. These improvements for the search direction are roughly divided into two categories. One is the momentum method[11] based on the principles of physics and the corresponding improved algorithms[12, 20, 21], the momentum method avoids excessive swing amplitude of the search track by retaining part of the potential energy of the original track to accelerate the convergence.


Using Traceless Genetic Programming for Solving Multiobjective Optimization Problems

arXiv.org Artificial Intelligence

Traceless Genetic Programming (TGP) is a Genetic Programming (GP) variant that is used in cases where the focus is rather the output of the program than the program itself. The main difference between TGP and other GP techniques is that TGP does not explicitly store the evolved computer programs. Two genetic operators are used in conjunction with TGP: crossover and insertion. In this paper, we shall focus on how to apply TGP for solving multi-objective optimization problems which are quite unusual for GP. Each TGP individual stores the output of a computer program (tree) representing a point in the search space. Numerical experiments show that TGP is able to solve very fast and very well the considered test problems.


Transform2Act: Learning a Transform-and-Control Policy for Efficient Agent Design

arXiv.org Artificial Intelligence

An agent's functionality is largely determined by its design, i.e., skeletal structure and joint attributes (e.g., length, size, strength). However, finding the optimal agent design for a given function is extremely challenging since the problem is inherently combinatorial and the design space is prohibitively large. Additionally, it can be costly to evaluate each candidate design which requires solving for its optimal controller. To tackle these problems, our key idea is to incorporate the design procedure of an agent into its decision-making process. Specifically, we learn a conditional policy that, in an episode, first applies a sequence of transform actions to modify an agent's skeletal structure and joint attributes, and then applies control actions under the new design. To handle a variable number of joints across designs, we use a graph-based policy where each graph node represents a joint and uses message passing with its neighbors to output joint-specific actions. Using policy gradient methods, our approach enables first-order optimization of agent design and control as well as experience sharing across different designs, which improves sample efficiency tremendously. Experiments show that our approach, Transform2Act, outperforms prior methods significantly in terms of convergence speed and final performance. Notably, Transform2Act can automatically discover plausible designs similar to giraffes, squids, and spiders. Our project website is at https://sites.google.com/view/transform2act.


Self-Evolutionary Optimization for Pareto Front Learning

arXiv.org Artificial Intelligence

Multi-task learning (MTL), which aims to improve performance by learning multiple tasks simultaneously, inherently presents an optimization challenge due to multiple objectives. Hence, multi-objective optimization (MOO) approaches have been proposed for multitasking problems. Recent MOO methods approximate multiple optimal solutions (Pareto front) with a single unified model, which is collectively referred to as Pareto front learning (PFL). In this paper, we show that PFL can be re-formulated into another MOO problem with multiple objectives, each of which corresponds to different preference weights for the tasks. We leverage an evolutionary algorithm (EA) to propose a method for PFL called self-evolutionary optimization (SEO) by directly maximizing the hypervolume. By using SEO, the neural network learns to approximate the Pareto front conditioned on multiple hyper-parameters that drastically affect the hypervolume. Then, by generating a population of approximations simply by inferencing the network, the hyper-parameters of the network can be optimized by EA. Utilizing SEO for PFL, we also introduce self-evolutionary Pareto networks (SEPNet), enabling the unified model to approximate the entire Pareto front set that maximizes the hypervolume. Extensive experimental results confirm that SEPNet can find a better Pareto front than the current state-of-the-art methods while minimizing the increase in model size and training cost.


Towards Federated Learning-Enabled Visible Light Communication in 6G Systems

arXiv.org Artificial Intelligence

Visible light communication (VLC) technology was introduced as a key enabler for the next generation of wireless networks, mainly thanks to its simple and low-cost implementation. However, several challenges prohibit the realization of the full potentials of VLC, namely, limited modulation bandwidth, ambient light interference, optical diffuse reflection effects, devices non-linearity, and random receiver orientation. On the contrary, centralized machine learning (ML) techniques have demonstrated a significant potential in handling different challenges relating to wireless communication systems. Specifically, it was shown that ML algorithms exhibit superior capabilities in handling complicated network tasks, such as channel equalization, estimation and modeling, resources allocation, and opportunistic spectrum access control, to name a few. Nevertheless, concerns pertaining to privacy and communication overhead when sharing raw data of the involved clients with a server constitute major bottlenecks in the implementation of centralized ML techniques. This has motivated the emergence of a new distributed ML paradigm, namely federated learning (FL), which can reduce the cost associated with transferring raw data, and preserve privacy by training ML models locally and collaboratively at the clients' side. Hence, it becomes evident that integrating FL into VLC networks can provide ubiquitous and reliable implementation of VLC systems. With this motivation, this is the first in-depth review in the literature on the application of FL in VLC networks. To that end, besides the different architectures and related characteristics of FL, we provide a thorough overview on the main design aspects of FL based VLC systems. Finally, we also highlight some potential future research directions of FL that are envisioned to substantially enhance the performance and robustness of VLC systems.


A Stochastic Newton Algorithm for Distributed Convex Optimization

arXiv.org Machine Learning

Stochastic optimization methods that leverage parallelism have proven immensely useful in modern optimization problems. Recent advances in machine learning have highlighted their importance as these techniques now rely on millions of parameters and increasingly large training sets. While there are many possible ways of parallelizing optimization algorithms, we consider the intermittent communication setting (Zinkevich et al., 2010; Cotter et al., 2011; Dekel et al., 2012; Shamir et al., 2014; Woodworth et al., 2018, 2021), where M parallel machines work together to optimize an objective during R rounds of communication, and where during each round each machine may perform some basic operation (e.g., access the objective by invoking some oracle) K times, and then communicate with all other machines. An important example of this setting is when this basic operation gives independent, unbiased stochastic estimates of the gradient, in which case this setting includes algorithms like Local SGD (Zinkevich et al., 2010; Coppola, 2015; Zhou and Cong, 2018; Stich, 2019; Woodworth et al., 2020a), Minibatch SGD (Dekel et al., 2012), Minibatch AC-SA (Ghadimi and Lan, 2012), and many others. We are motivated by the observation of Woodworth et al. (2020a) that for quadratic objectives, first-order methods such as one-shot averaging (Zinkevich et al., 2010; Zhang et al., 2013)--a special case of Local SGD with a single round of communication--can optimize the objective to a very high degree of accuracy. This prompts trying to reduce the task of optimizing general convex objectives to a short sequence of quadratic problems. Indeed, this is precisely the idea behind many second-order algorithms including Newton's method


Solving Multistage Stochastic Linear Programming via Regularized Linear Decision Rules: An Application to Hydrothermal Dispatch Planning

arXiv.org Machine Learning

The solution of multistage stochastic linear problems (MSLP) represents a challenge for many applications. Long-term hydrothermal dispatch planning (LHDP) materializes this challenge in a real-world problem that affects electricity markets, economies, and natural resources worldwide. No closed-form solutions are available for MSLP and the definition of non-anticipative policies with high-quality out-of-sample performance is crucial. Linear decision rules (LDR) provide an interesting simulation-based framework for finding high-quality policies to MSLP through two-stage stochastic models. In practical applications, however, the number of parameters to be estimated when using an LDR may be close or higher than the number of scenarios, thereby generating an in-sample overfit and poor performances in out-of-sample simulations. In this paper, we propose a novel regularization scheme for LDR based on the AdaLASSO (adaptive least absolute shrinkage and selection operator). The goal is to use the parsimony principle as largely studied in high-dimensional linear regression models to obtain better out-of-sample performance for an LDR applied to MSLP. Computational experiments show that the overfit threat is non-negligible when using the classical non-regularized LDR to solve MSLP. For the LHDP problem, our analysis highlights the following benefits of the proposed framework in comparison to the non-regularized benchmark: 1) significant reductions in the number of non-zero coefficients (model parsimony), 2) substantial cost reductions in out-of-sample evaluations, and 3) improved spot-price profiles.