Goto

Collaborating Authors

Results


Are we Forgetting about Compositional Optimisers in Bayesian Optimisation?

arXiv.org Machine Learning

Bayesian optimisation presents a sample-efficient methodology for global optimisation. Within this framework, a crucial performance-determining subroutine is the maximisation of the acquisition function, a task complicated by the fact that acquisition functions tend to be non-convex and thus nontrivial to optimise. In this paper, we undertake a comprehensive empirical study of approaches to maximise the acquisition function. Additionally, by deriving novel, yet mathematically equivalent, compositional forms for popular acquisition functions, we recast the maximisation task as a compositional optimisation problem, allowing us to benefit from the extensive literature in this field. We highlight the empirical advantages of the compositional approach to acquisition function maximisation across 3958 individual experiments comprising synthetic optimisation tasks as well as tasks from Bayesmark. Given the generality of the acquisition function maximisation subroutine, we posit that the adoption of compositional optimisers has the potential to yield performance improvements across all domains in which Bayesian optimisation is currently being applied.


Convergence of block coordinate descent with diminishing radius for nonconvex optimization

arXiv.org Machine Learning

Block coordinate descent (BCD), also known as nonlinear Gauss-Seidel, is a simple iterative algorithm for nonconvex optimization that sequentially minimizes the objective function in each block coordinate while the other coordinates are held fixed. It is known that block-wise convexity of the objective is not enough to guarantee convergence of BCD to the stationary points and some additional regularity condition is needed. In this work, we provide a simple modification of BCD that has guaranteed global convergence to the stationary points for block-wise convex objective function without additional conditions. Our idea is to restrict the parameter search within a diminishing radius to promote stability of iterates, and then to show that such auxiliary constraint vanishes in the limit. As an application, we provide a modified alternating least squares algorithm for nonnegative CP tensor factorization that is guaranteed to converge to the stationary points of reconstruction error function. We also provide some experimental validation of our result.


A similarity-based Bayesian mixture-of-experts model

arXiv.org Machine Learning

We present a new nonparametric mixture-of-experts model for multivariate regression problems, inspired by the probabilistic $k$-nearest neighbors algorithm. Using a conditionally specified model, predictions for out-of-sample inputs are based on similarities to each observed data point, yielding predictive distributions represented by Gaussian mixtures. Posterior inference is performed on the parameters of the mixture components as well as the distance metric using a mean-field variational Bayes algorithm accompanied with a stochastic gradient-based optimization procedure. The proposed method is especially advantageous in settings where inputs are of relatively high dimension in comparison to the data size, where input--output relationships are complex, and where predictive distributions may be skewed or multimodal. Computational studies on two synthetic datasets and one dataset comprising dose statistics of radiation therapy treatment plans show that our mixture-of-experts method outperforms a Gaussian process benchmark model both in terms of validation metrics and visual inspection.


VisEvol: Visual Analytics to Support Hyperparameter Search through Evolutionary Optimization

arXiv.org Machine Learning

During the training phase of machine learning (ML) models, it is usually necessary to configure several hyperparameters. This process is computationally intensive and requires an extensive search to infer the best hyperparameter set for the given problem. The challenge is exacerbated by the fact that most ML models are complex internally, and training involves trial-and-error processes that could remarkably affect the predictive result. Moreover, each hyperparameter of an ML algorithm is potentially intertwined with the others, and changing it might result in unforeseeable impacts on the remaining hyperparameters. Evolutionary optimization is a promising method to try and address those issues. According to this method, performant models are stored, while the remainder are improved through crossover and mutation processes inspired by genetic algorithms. We present VisEvol, a visual analytics tool that supports interactive exploration of hyperparameters and intervention in this evolutionary procedure. In summary, our proposed tool helps the user to generate new models through evolution and eventually explore powerful hyperparameter combinations in diverse regions of the extensive hyperparameter space. The outcome is a voting ensemble (with equal rights) that boosts the final predictive performance. The utility and applicability of VisEvol are demonstrated with two use cases and interviews with ML experts who evaluated the effectiveness of the tool.


DNA mixture deconvolution using an evolutionary algorithm with multiple populations, hill-climbing, and guided mutation

arXiv.org Machine Learning

DNA samples crime cases analysed in forensic genetics, frequently contain DNA from multiple contributors. These occur as convolutions of the DNA profiles of the individual contributors to the DNA sample. Thus, in cases where one or more of the contributors were unknown, an objective of interest would be the separation, often called deconvolution, of these unknown profiles. In order to obtain deconvolutions of the unknown DNA profiles, we introduced a multiple population evolutionary algorithm (MEA). We allowed the mutation operator of the MEA to utilise that the fitness is based on a probabilistic model and guide it by using the deviations between the observed and the expected value for every element of the encoded individual. This guided mutation operator (GM) was designed such that the larger the deviation the higher probability of mutation. Furthermore, the GM was inhomogeneous in time, decreasing to a specified lower bound as the number of iterations increased. We analysed 102 two-person DNA mixture samples in varying mixture proportions. The samples were quantified using two different DNA prep. kits: (1) Illumina ForenSeq Panel B (30 samples), and (2) Applied Biosystems Precision ID Globalfiler NGS STR panel (72 samples). The DNA mixtures were deconvoluted by the MEA and compared to the true DNA profiles of the sample. We analysed three scenarios where we assumed: (1) the DNA profile of the major contributor was unknown, (2) DNA profile of the minor was unknown, and (3) both DNA profiles were unknown. Furthermore, we conducted a series of sensitivity experiments on the ForenSeq panel by varying the sub-population size, comparing a completely random homogeneous mutation operator to the guided operator with varying mutation decay rates, and allowing for hill-climbing of the parent population.


A Hypergradient Approach to Robust Regression without Correspondence

arXiv.org Machine Learning

We consider a regression problem, where the correspondence between input and output data is not available. Such shuffled data is commonly observed in many real world problems. Taking flow cytometry as an example, the measuring instruments are unable to preserve the correspondence between the samples and the measurements. Due to the combinatorial nature, most of existing methods are only applicable when the sample size is small, and limited to linear regression models. To overcome such bottlenecks, we propose a new computational framework - ROBOT- for the shuffled regression problem, which is applicable to large data and complex models. Specifically, we propose to formulate the regression without correspondence as a continuous optimization problem. Then by exploiting the interaction between the regression model and the data correspondence, we propose to develop a hypergradient approach based on differentiable programming techniques. Such a hypergradient approach essentially views the data correspondence as an operator of the regression, and therefore allows us to find a better descent direction for the model parameter by differentiating through the data correspondence. ROBOT is quite general, and can be further extended to the inexact correspondence setting, where the input and output data are not necessarily exactly aligned. Thorough numerical experiments show that ROBOT achieves better performance than existing methods in both linear and nonlinear regression tasks, including real-world applications such as flow cytometry and multi-object tracking.


Explaining by Removing: A Unified Framework for Model Explanation

arXiv.org Machine Learning

Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We establish a new class of methods, removal-based explanations, that are based on the principle of simulating feature removal to quantify each feature's influence. These methods vary in several respects, so we develop a framework that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature's influence. Our framework unifies 25 existing methods, including several of the most widely used approaches (SHAP, LIME, Meaningful Perturbations, permutation tests). This new class of explanation methods has rich connections that we examine using tools that have been largely overlooked by the explainability literature. To anchor removal-based explanations in cognitive psychology, we show that feature removal is a simple application of subtractive counterfactual reasoning. Ideas from cooperative game theory shed light on the relationships and trade-offs among different methods, and we derive conditions under which all removal-based explanations have information-theoretic interpretations. Through this analysis, we develop a unified framework that helps practitioners better understand model explanation tools, and that offers a strong theoretical foundation upon which future explainability research can build.


Towards Metaheuristics "In the Large"

arXiv.org Artificial Intelligence

Following decades of sustained improvement, metaheuristics are one of the great success stories of optimization research. However, in order for research in metaheuristics to avoid fragmentation and a lack of reproducibility, there is a pressing need for stronger scientific and computational infrastructure to support the development, analysis and comparison of new approaches. We argue that, via principled choice of infrastructure support, the field can pursue a higher level of scientific enquiry. We describe our vision and report on progress, showing how the adoption of common protocols for all metaheuristics can help liberate the potential of the field, easing the exploration of the design space of metaheuristics.


Federated Composite Optimization

arXiv.org Machine Learning

Federated Learning (FL) is a distributed learning paradigm which scales on-device learning collaboratively and privately. Standard FL algorithms such as Federated Averaging (FedAvg) are primarily geared towards smooth unconstrained settings. In this paper, we study the Federated Composite Optimization (FCO) problem, where the objective function in FL includes an additive (possibly) non-smooth component. Such optimization problems are fundamental to machine learning and arise naturally in the context of regularization (e.g., sparsity, low-rank, monotonicity, and constraint). To tackle this problem, we propose different primal/dual averaging approaches and study their communication and computation complexities. Of particular interest is Federated Dual Averaging (FedDualAvg), a federated variant of the dual averaging algorithm. FedDualAvg uses a novel double averaging procedure, which involves gradient averaging step in standard dual averaging and an average of client updates akin to standard federated averaging. Our theoretical analysis and empirical experiments demonstrate that FedDualAvg outperforms baselines for FCO.


A Survey on the Explainability of Supervised Machine Learning

arXiv.org Machine Learning

Predictions obtained by, e.g., artificial neural networks have a high accuracy but humans often perceive the models as black boxes. Insights about the decision making are mostly opaque for humans. Particularly understanding the decision making in highly sensitive areas such as healthcare or fifinance, is of paramount importance. The decision-making behind the black boxes requires it to be more transparent, accountable, and understandable for humans. This survey paper provides essential definitions, an overview of the different principles and methodologies of explainable Supervised Machine Learning (SML). We conduct a state-of-the-art survey that reviews past and recent explainable SML approaches and classifies them according to the introduced definitions. Finally, we illustrate principles by means of an explanatory case study and discuss important future directions.