Optimization
Efficient computation of contrastive explanations
Artelt, André, Hammer, Barbara
With the increasing deployment of machine learning systems in practice, transparency and explainability have become serious issues. Contrastive explanations are considered to be useful and intuitive, in particular when it comes to explaining decisions to lay people, since they mimic the way in which humans explain. Yet, so far, comparably little research has addressed computationally feasible technologies, which allow guarantees on uniqueness and optimality of the explanation and which enable an easy incorporation of additional constraints. Here, we will focus on specific types of models rather than black-box technologies. We study the relation of contrastive and counterfactual explanations and propose mathematical formalizations as well as a 2-phase algorithm for efficiently computing pertinent positives of many standard machine learning models.
SCOBO: Sparsity-Aware Comparison Oracle Based Optimization
Cai, HanQin, Mckenzie, Daniel, Yin, Wotao, Zhang, Zhenliang
We study derivative-free optimization for convex functions where we further assume that function evaluations are unavailable. Instead, one only has access to a comparison oracle, which, given two points $x$ and $y$, and returns a single bit of information indicating which point has larger function value, $f(x)$ or $f(y)$, with some probability of being incorrect. This probability may be constant or it may depend on $|f(x)-f(y)|$. Previous algorithms for this problem have been hampered by a query complexity which is polynomially dependent on the problem dimension, $d$. We propose a novel algorithm that breaks this dependence: it has query complexity only logarithmically dependent on $d$ if the function in addition has low dimensional structure that can be exploited. Numerical experiments on synthetic data and the MuJoCo dataset show that our algorithm outperforms state-of-the-art methods for comparison based optimization, and is even competitive with other derivative-free algorithms that require explicit function evaluations.
Parameter Optimization using high-dimensional Bayesian Optimization
In this thesis, I explore the possibilities of conducting Bayesian optimization techniques in high dimensional domains. Although high dimensional domains can be defined to be between hundreds and thousands of dimensions, we will primarily focus on problem settings that occur between two and 20 dimensions. As such, we focus on solutions to practical problems, such as tuning the parameters for an electron accelerator, or for even simpler tasks that can be run and optimized just in time with a standard laptop at hand. Our main contributions are 1.) comparing how the log-likelihood affects the angle-difference in the real projection matrix, and the found matrix matrix, 2.) an extensive analysis of current popular methods including strengths and shortcomings, 3.) a short analysis on how dimensionality reduction techniques can be used for feature selection, and 4.) a novel algorithm called "BORING", which allows for a simple fallback mechanism if the matrix identification fails, as well as taking into consideration "passive" subspaces which provide small perturbations of the function at hand. The main features of BORING are 1.) the possibility to identify the subspace (unlike most other optimization algorithms), and 2.) to provide a much lower penalty to identify the subspace if identification fails, as optimization is still the primary goal.
Projection Efficient Subgradient Method and Optimal Nonsmooth Frank-Wolfe Method
Thekumparampil, Kiran Koshy, Jain, Prateek, Netrapalli, Praneeth, Oh, Sewoong
We consider the classical setting of optimizing a nonsmooth Lipschitz continuous convex function over a convex constraint set, when having access to a (stochastic) first-order oracle (FO) for the function and a projection oracle (PO) for the constraint set. It is well known that to achieve $\epsilon$-suboptimality in high-dimensions, $\Theta(\epsilon^{-2})$ FO calls are necessary. This is achieved by the projected subgradient method (PGD). However, PGD also entails $O(\epsilon^{-2})$ PO calls, which may be computationally costlier than FO calls (e.g. nuclear norm constraints). Improving this PO calls complexity of PGD is largely unexplored, despite the fundamental nature of this problem and extensive literature. We present first such improvement. This only requires a mild assumption that the objective function, when extended to a slightly larger neighborhood of the constraint set, still remains Lipschitz and accessible via FO. In particular, we introduce MOPES method, which carefully combines Moreau-Yosida smoothing and accelerated first-order schemes. This is guaranteed to find a feasible $\epsilon$-suboptimal solution using only $O(\epsilon^{-1})$ PO calls and optimal $O(\epsilon^{-2})$ FO calls. Further, instead of a PO if we only have a linear minimization oracle (LMO, a la Frank-Wolfe) to access the constraint set, an extension of our method, MOLES, finds a feasible $\epsilon$-suboptimal solution using $O(\epsilon^{-2})$ LMO calls and FO calls---both match known lower bounds, resolving a question left open since White (1993). Our experiments confirm that these methods achieve significant speedups over the state-of-the-art, for a problem with costly PO and LMO calls.
Auxiliary Learning by Implicit Differentiation
Navon, Aviv, Achituve, Idan, Maron, Haggai, Chechik, Gal, Fetaya, Ethan
Training with multiple auxiliary tasks is a common practice used in deep learning for improving the performance on the main task of interest. Two main challenges arise in this multi-task learning setting: (i) Designing useful auxiliary tasks; and (ii) Combining auxiliary tasks into a single coherent loss. We propose a novel framework, AuxiLearn, that targets both challenges, based on implicit differentiation. First, when useful auxiliaries are known, we propose learning a network that combines all losses into a single coherent objective function. This network can learn non-linear interactions between auxiliary tasks. Second, when no useful auxiliary task is known, we describe how to learn a network that generates a meaningful, novel auxiliary task. We evaluate AuxiLearn in a series of tasks and domains, including image segmentation and learning with attributes. We find that AuxiLearn consistently improves accuracy compared with competing methods.
Descending through a Crowded Valley -- Benchmarking Deep Learning Optimizers
Schmidt, Robin M., Schneider, Frank, Hennig, Philipp
Choosing the optimizer is considered to be among the most crucial design decisions in deep learning, and it is not an easy one. The growing literature now lists hundreds of optimization methods. In the absence of clear theoretical guidance and conclusive empirical evidence, the decision is often made based on anecdotes. In this work, we aim to replace these anecdotes, if not with a conclusive ranking, then at least with evidence-backed heuristics. To do so, we perform an extensive, standardized benchmark of more than a dozen particularly popular deep learning optimizers while giving a concise overview of the wide range of possible choices. Analyzing almost 35,000 individual runs, we contribute the following three points: (i) Optimizer performance varies greatly across tasks. This subset includes popular favorites and some lesser-known contenders. We have open-sourced all our experimental results, making them directly available as challenging and well-tuned baselines. This allows for more meaningful comparisons when evaluating novel optimization methods without requiring any further computational efforts. Large-scale stochastic optimization drives a wide variety of machine learning tasks. Because choosing the right optimization algorithm and effectively tuning its hyperparameters heavily influences the training speed and final performance of the learned model, doing so is an important, everyday challenge to practitioners. Hence, stochastic optimization methods have been a focal point of research (cf. Figure 1), engendering an ever-growing list of algorithms, many of them specifically targeted towards deep learning.
Linear Last-iterate Convergence in Constrained Saddle-point Optimization
Wei, Chen-Yu, Lee, Chung-Wei, Zhang, Mengxiao, Luo, Haipeng
Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative Weights Update (OMWU) for saddle-point optimization have received growing attention due to their favorable last-iterate convergence. However, their behaviors for simple bilinear games over the probability simplex are still not fully understood -- previous analysis lacks explicit convergence rates, only applies to an exponentially small learning rate, or requires additional assumptions such as the uniqueness of the optimal solution. In this work, we significantly expand the understanding of last-iterate convergence for OGDA and OMWU in the constrained setting. Specifically, for OMWU in bilinear games over the simplex, we show that when the equilibrium is unique, linear last-iterate convergence is achievable with a constant learning rate, which improves the result of (Daskalakis & Panageas, 2019) under the same assumption. We then significantly extend the results to more general objectives and feasible sets for the projected OGDA algorithm, by introducing a sufficient condition under which OGDA exhibits concrete last-iterate convergence rates with a constant learning rate. We show that bilinear games over any polytope satisfy this condition and OGDA converges exponentially fast even without the unique equilibrium assumption. Our condition also holds for strongly-convex-strongly-concave functions, recovering the result of (Hsieh et al., 2019). Finally, we provide experimental results to further support our theory.
Artificial Intelligence: Research Impact on Key Industries; the Upper-Rhine Artificial Intelligence Symposium (UR-AI 2020)
The TriRhenaTech alliance presents a collection of accepted papers of the cancelled tri-national 'Upper-Rhine Artificial Inteeligence Symposium' planned for 13th May 2020 in Karlsruhe. The TriRhenaTech alliance is a network of universities in the Upper-Rhine Trinational Metropolitan Region comprising of the German universities of applied sciences in Furtwangen, Kaiserslautern, Karlsruhe, and Offenburg, the Baden-Wuerttemberg Cooperative State University Loerrach, the French university network Alsace Tech (comprised of 14 'grandes \'ecoles' in the fields of engineering, architecture and management) and the University of Applied Sciences and Arts Northwestern Switzerland. The alliance's common goal is to reinforce the transfer of knowledge, research, and technology, as well as the cross-border mobility of students.
PareCO: Pareto-aware Channel Optimization for Slimmable Neural Networks
Chin, Ting-Wu, Morcos, Ari S., Marculescu, Diana
Slimmable neural networks provide a flexible trade-off front between prediction error and computational cost (such as the number of floating-point operations or FLOPs) with the same storage cost as a single model. They have been proposed recently for resource-constrained settings such as mobile devices. However, current slimmable neural networks use a single width-multiplier for all the layers to arrive at sub-networks with different performance profiles, which neglects that different layers affect the network's prediction accuracy differently and have different FLOP requirements. Hence, developing a principled approach for deciding width-multipliers across different layers could potentially improve the performance of slimmable networks. To allow for heterogeneous width-multipliers across different layers, we formulate the problem of optimizing slimmable networks from a multi-objective optimization lens, which leads to a novel algorithm for optimizing both the shared weights and the width-multipliers for the sub-networks. We perform extensive empirical analysis with 14 network and dataset combinations and find that less over-parameterized networks benefit more from a joint channel and weight optimization than extremely over-parameterized networks. Quantitatively, improvements up to 1.7% and 1% in top-1 accuracy on the ImageNet dataset can be attained for MobileNetV2 and MobileNetV3, respectively. Our results highlight the potential of optimizing the channel counts for different layers jointly with the weights for slimmable networks.