Optimization
FedFOR: Stateless Heterogeneous Federated Learning with First-Order Regularization
Tian, Junjiao, Smith, James Seale, Kira, Zsolt
Federated Learning (FL) seeks to distribute model training across local clients without collecting data in a centralized data-center, hence removing data-privacy concerns. A major challenge for FL is data heterogeneity (where each client's data distribution can differ) as it can lead to weight divergence among local clients and slow global convergence. The current SOTA FL methods designed for data heterogeneity typically impose regularization to limit the impact of non-IID data and are stateful algorithms, i.e., they maintain local statistics over time. While effective, these approaches can only be used for a special case of FL involving only a small number of reliable clients. For the more typical applications of FL where the number of clients is large (e.g., edge-device and mobile applications), these methods cannot be applied, motivating the need for a stateless approach to heterogeneous FL which can be used for any number of clients. We derive a first-order gradient regularization to penalize inconsistent local updates due to local data heterogeneity. Specifically, to mitigate weight divergence, we introduce a first-order approximation of the global data distribution into local objectives, which intuitively penalizes updates in the opposite direction of the global update. The end result is a stateless FL algorithm that achieves 1) significantly faster convergence (i.e., fewer communication rounds) and 2) higher overall converged performance than SOTA methods under non-IID data distribution. Importantly, our approach does not impose unrealistic limits on the client size, enabling learning from a large number of clients as is typical in most FL applications.
Model Predictive Control for Dynamic Cloth Manipulation: Parameter Learning and Experimental Validation
Luque, Adrià, Parent, David, Colomé, Adrià, Ocampo-Martinez, Carlos, Torras, Carme
Robotic cloth manipulation is a relevant challenging problem for autonomous robotic systems. Highly deformable objects as textile items can adopt multiple configurations and shapes during their manipulation. Hence, robots should not only understand the current cloth configuration but also be able to predict the future possible behaviors of the cloth. This paper addresses the problem of indirectly controlling the configuration of certain points of a textile object, by applying actions on other parts of the object through the use of a Model Predictive Control (MPC) strategy, which also allows to foresee the behavior of indirectly controlled points. The designed controller finds the optimal control signals to attain the desired future target configuration. The explored scenario in this paper considers tracking a reference trajectory with the lower corners of a square piece of cloth by grasping its upper corners. To do so, we propose and validate a linear cloth model that allows solving the MPC-related optimization problem in real time. Reinforcement Learning (RL) techniques are used to learn the optimal parameters of the proposed cloth model and also to tune the resulting MPC. After obtaining accurate tracking results in simulation, the full control scheme was implemented and executed in a real robot, obtaining accurate tracking even in adverse conditions. While total observed errors reach the 5 cm mark, for a 30x30 cm cloth, an analysis shows the MPC contributes less than 30% to that value.
An Inertial Block Majorization Minimization Framework for Nonsmooth Nonconvex Optimization
Hien, Le Thi Khanh, Phan, Duy Nhat, Gillis, Nicolas
In this paper, we introduce TITAN, a novel inerTIal block majorizaTion minimizAtioN framework for non-smooth non-convex optimization problems. To the best of our knowledge, TITAN is the first framework of block-coordinate update method that relies on the majorization-minimization framework while embedding inertial force to each step of the block updates. The inertial force is obtained via an extrapolation operator that subsumes heavy-ball and Nesterov-type accelerations for block proximal gradient methods as special cases. By choosing various surrogate functions, such as proximal, Lipschitz gradient, Bregman, quadratic, and composite surrogate functions, and by varying the extrapolation operator, TITAN produces a rich set of inertial block-coordinate update methods. We study sub-sequential convergence as well as global convergence for the generated sequence of TITAN. We illustrate the effectiveness of TITAN on two important machine learning problems, namely sparse non-negative matrix factorization and matrix completion.
Lower Bounds on the Worst-Case Complexity of Efficient Global Optimization
Xu, Wenjie, Jiang, Yuning, Maddalena, Emilio T., Jones, Colin N.
Efficient global optimization is a widely used method for optimizing expensive black-box functions such as tuning hyperparameter, and designing new material, etc. Despite its popularity, less attention has been paid to analyzing the inherent hardness of the problem although, given its extensive use, it is important to understand the fundamental limits of efficient global optimization algorithms. In this paper, we study the worst-case complexity of the efficient global optimization problem and, in contrast to existing kernel-specific results, we derive a unified lower bound for the complexity of efficient global optimization in terms of the metric entropy of a ball in its corresponding reproducing kernel Hilbert space~(RKHS). Specifically, we show that if there exists a deterministic algorithm that achieves suboptimality gap smaller than $\epsilon$ for any function $f\in S$ in $T$ function evaluations, it is necessary that $T$ is at least $\Omega\left(\frac{\log\mathcal{N}(S(\mathcal{X}), 4\epsilon,\|\cdot\|_\infty)}{\log(\frac{R}{\epsilon})}\right)$, where $\mathcal{N}(\cdot,\cdot,\cdot)$ is the covering number, $S$ is the ball centered at $0$ with radius $R$ in the RKHS and $S(\mathcal{X})$ is the restriction of $S$ over the feasible set $\mathcal{X}$. Moreover, we show that this lower bound nearly matches the upper bound attained by non-adaptive search algorithms for the commonly used squared exponential kernel and the Mat\'ern kernel with a large smoothness parameter $\nu$, up to a replacement of $d/2$ by $d$ and a logarithmic term $\log\frac{R}{\epsilon}$. That is to say, our lower bound is nearly optimal for these kernels.
S-Rocket: Selective Random Convolution Kernels for Time Series Classification
Salehinejad, Hojjat, Wang, Yang, Yu, Yuanhao, Jin, Tang, Valaee, Shahrokh
Random convolution kernel transform (Rocket) is a fast, efficient, and novel approach for time series feature extraction using a large number of independent randomly initialized 1-D convolution kernels of different configurations. The output of the convolution operation on each time series is represented by a partial positive value (PPV). A concatenation of PPVs from all kernels is the input feature vector to a Ridge regression classifier. Unlike typical deep learning models, the kernels are not trained and there is no weighted/trainable connection between kernels or concatenated features and the classifier. Since these kernels are generated randomly, a portion of these kernels may not positively contribute in performance of the model. Hence, selection of the most important kernels and pruning the redundant and less important ones is necessary to reduce computational complexity and accelerate inference of Rocket for applications on the edge devices. Selection of these kernels is a combinatorial optimization problem. In this paper, we propose a scheme for selecting these kernels while maintaining the classification performance. First, the original model is pre-trained at full capacity. Then, a population of binary candidate state vectors is initialized where each element of a vector represents the active/inactive status of a kernel. A population-based optimization algorithm evolves the population in order to find a best state vector which minimizes the number of active kernels while maximizing the accuracy of the classifier. This activation function is a linear combination of the total number of active kernels and the classification accuracy of the pre-trained classifier with the active kernels. Finally, the selected kernels in the best state vector are utilized to train the Ridge regression classifier with the selected kernels.
Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance
Yuan, Zhuoning, Wu, Yuexin, Qiu, Zi-Hao, Du, Xianzhi, Zhang, Lijun, Zhou, Denny, Yang, Tianbao
In this paper, we study contrastive learning from an optimization perspective, aiming to analyze and address a fundamental issue of existing contrastive learning methods that either rely on a large batch size or a large dictionary of feature vectors. We consider a global objective for contrastive learning, which contrasts each positive pair with all negative pairs for an anchor point. From the optimization perspective, we explain why existing methods such as SimCLR require a large batch size in order to achieve a satisfactory result. In order to remove such requirement, we propose a memory-efficient Stochastic Optimization algorithm for solving the Global objective of Contrastive Learning of Representations, named SogCLR. We show that its optimization error is negligible under a reasonable condition after a sufficient number of iterations or is diminishing for a slightly different global contrastive objective. Empirically, we demonstrate that SogCLR with small batch size (e.g., 256) can achieve similar performance as SimCLR with large batch size (e.g., 8192) on self-supervised learning task on ImageNet-1K. We also attempt to show that the proposed optimization technique is generic and can be applied to solving other contrastive losses, e.g., two-way contrastive losses for bimodal contrastive learning. The proposed method is implemented in our open-sourced library LibAUC (www.libauc.org).
Learning Acceptance Regions for Many Classes with Anomaly Detection
In multicategory classification, traditional methods return a single class label as the prediction without a confidence measure attached. For points near the classification boundary where the classes overlap, these methods may misclassify with high probability. As classification and machine learning in general have played a more and more significant role in high stake domains, these mistakes can incur severe consequences. To avoid making mistakes when they are likely to happen, set-valued classification methods have emerged (Herbei and Wegkamp, 2006; Shafer and Vovk, 2008; Dümbgen et al., 2008; Denis and Hebiri, 2017; Wang and Qiao, 2018; Zhang et al., 2018; Sadinle et al., 2019). A set-valued classifier may return multiple class labels as the prediction for each observation. Specifically, those near the boundary between classes may receive multiple labels as the prediction. Herbei and Wegkamp (2006), Bartlett and Wegkamp (2008), Ramaswamy et al. (2015) and Zhang et al. (2018) proposed and developed Classification with a Reject Option (CRO) by training a classifier and a rejector at the same time. A rejector determines when to refuse to make a classification for ambiguous points (i.e.
Industrial Data Science for Batch Manufacturing Processes
Arzac-Garmendia, Imanol, Vallerio, Mattia, Perez-Galvan, Carlos, Navarro-Brull, Francisco J.
Batch processes show several sources of variability, from raw materials' properties to initial and evolving conditions that change during the different events in the manufacturing process. In this chapter, we will illustrate with an industrial example how to use machine learning to reduce this apparent excess of data while maintaining the relevant information for process engineers. Two common use cases will be presented: 1) AutoML analysis to quickly find correlations in batch process data, and 2) trajectory analysis to monitor and identify anomalous batches leading to process control improvements.
Signal Decomposition Using Masked Proximal Operators
Meyers, Bennet E., Boyd, Stephen P.
We consider the well-studied problem of decomposing a vector time series signal into components with different characteristics, such as smooth, periodic, nonnegative, or sparse. We describe a simple and general framework in which the components are defined by loss functions (which include constraints), and the signal decomposition is carried out by minimizing the sum of losses of the components (subject to the constraints). When each loss function is the negative log-likelihood of a density for the signal component, this framework coincides with maximum a posteriori probability (MAP) estimation; but it also includes many other interesting cases. Summarizing and clarifying prior results, we give two distributed optimization methods for computing the decomposition, which find the optimal decomposition when the component class loss functions are convex, and are good heuristics when they are not. Both methods require only the masked proximal operator of each of the component loss functions, a generalization of the well-known proximal operator that handles missing entries in its argument. Both methods are distributed, i.e., handle each component separately. We derive tractable methods for evaluating the masked proximal operators of some loss functions that, to our knowledge, have not appeared in the literature.
Hybrid Learning- and Model-Based Planning and Control of In-Hand Manipulation
Zarrin, Rana Soltani, Yamane, Katsu, Jitosho, Rianna
This paper presents a hierarchical framework for planning and control of in-hand manipulation of a rigid object involving grasp changes using fully-actuated multifingered robotic hands. While the framework can be applied to the general dexterous manipulation, we focus on a more complex definition of in-hand manipulation, where at the goal pose the hand has to reach a grasp suitable for using the object as a tool. The high level planner determines the object trajectory as well as the grasp changes, i.e. adding, removing, or sliding fingers, to be executed by the low-level controller. While the grasp sequence is planned online by a learning-based policy to adapt to variations, the trajectory planner and the low-level controller for object tracking and contact force control are exclusively model-based to robustly realize the plan. By infusing the knowledge about the physics of the problem and the low-level controller into the grasp planner, it learns to successfully generate grasps similar to those generated by model-based optimization approaches, obviating the high computation cost of online running of such methods to account for variations. By performing experiments in physics simulation for realistic tool use scenarios, we show the success of our method on different tool-use tasks and dexterous hand models. Additionally, we show that this hybrid method offers more robustness to trajectory and task variations compared to a model-based method.