Goto

Collaborating Authors

 Optimization


Towards Interpretable Multi-Task Learning Using Bilevel Programming

arXiv.org Machine Learning

Interpretable Multi-Task Learning can be expressed as learning a sparse graph of the task relationship based on the prediction performance of the learned models. Since many natural phenomenon exhibit sparse structures, enforcing sparsity on learned models reveals the underlying task relationship. Moreover, different sparsification degrees from a fully connected graph uncover various types of structures, like cliques, trees, lines, clusters or fully disconnected graphs. In this paper, we propose a bilevel formulation of multi-task learning that induces sparse graphs, thus, revealing the underlying task relationships, and an efficient method for its computation. We show empirically how the induced sparse graph improves the interpretability of the learned models and their relationship on synthetic and real data, without sacrificing generalization performance.


Bayesian Screening: Multi-test Bayesian Optimization Applied to in silico Material Screening

arXiv.org Machine Learning

We present new multi-test Bayesian optimization models and algorithms for use in large scale material screening applications. Our screening problems are designed around two tests, one expensive and one cheap. This paper differs from other recent work on multi-test Bayesian optimization through use of a flexible model that allows for complex, non-linear relationships between the cheap and expensive test scores. This additional modeling flexibility is essential in the material screening applications which we describe. We demonstrate the power of our new algorithms on a family of synthetic toy problems as well as on real data from two large scale screening studies.


Disentangling Neural Architectures and Weights: A Case Study in Supervised Classification

arXiv.org Machine Learning

The history of deep learning has shown that human-designed problem-specific networks can greatly improve the classification performance of general neural models. In most practical cases, however, choosing the optimal architecture for a given task remains a challenging problem. Recent architecture-search methods are able to automatically build neural models with strong performance but fail to fully appreciate the interaction between neural architecture and weights. This work investigates the problem of disentangling the role of the neural structure and its edge weights, by showing that well-trained architectures may not need any link-specific fine-tuning of the weights. We compare the performance of such weight-free networks (in our case these are binary networks with {0, 1}-valued weights) with random, weight-agnostic, pruned and standard fully connected networks. To find the optimal weight-agnostic network, we use a novel and computationally efficient method that translates the hard architecture-search problem into a feasible optimization problem.More specifically, we look at the optimal task-specific architectures as the optimal configuration of binary networks with {0, 1}-valued weights, which can be found through an approximate gradient descent strategy. Theoretical convergence guarantees of the proposed algorithm are obtained by bounding the error in the gradient approximation and its practical performance is evaluated on two real-world data sets. For measuring the structural similarities between different architectures, we use a novel spectral approach that allows us to underline the intrinsic differences between real-valued networks and weight-free architectures.


Simplify Calculus for Machine Learning with SymPy

#artificialintelligence

Machine learning requires some calculus. Many of the online Machine learning courses don't always cover the basics of calculus assuming the user already has a foundation. If you are anything like me, you might need a bit of a refresher. Let's take a look at a few basic calculus concepts and how to write them in your code using SymPy. Most of the time, we need calculus to find the derivatives in optimization problems.


A Causal Linear Model to Quantify Edge Unfairness for Unfair Edge Prioritization and Discrimination Removal

arXiv.org Artificial Intelligence

The dataset can be generated by an unfair mechanism in numerous settings. For instance, a judicial system is unfair if it rejects the bail plea of an accused based on the race. To mitigate the unfairness in the procedure generating the dataset, we need to identify the sources of unfairness, quantify the unfairness in these sources, quantify how these sources affect the overall unfairness, and prioritize the sources before addressing the real-world issues underlying them. Prior work of (Zhang, et. al, 2017) identifies and removes discrimination after data is generated but does not suggest a methodology to mitigate unfairness in the data generation phase. We use the notion of an unfair edge, same as (Chiappa, et. al, 2018), to be the source of discrimination and quantify unfairness along an unfair edge. We also quantify overall unfairness in a particular decision towards a subset of sensitive attributes in terms of edge unfairness and measure the sensitivity of the former when the latter is varied. Using the formulation of cumulative unfairness in terms of edge unfairness, we alter the discrimination removal methodology discussed in (Zhang, et. al, 2017) by not formulating it as an optimization problem. This helps in getting rid of constraints that grow exponentially in the number of sensitive attributes and values taken by them. Finally, we discuss a priority algorithm for policymakers to address the real-world issues underlying the edges that result in unfairness. The experimental section validates the linear model assumption made to quantify edge unfairness.


Momentum-based Gradient Methods in Multi-objective Recommender Systems

arXiv.org Artificial Intelligence

Multi-objective gradient methods are becoming the standard for solving multi-objective problems. Among others, they show promising results in developing multi-objective recommender systems with both correlated and uncorrelated objectives. Classic multi-gradient descent usually relies on the combination of the gradients, not including the computation of first and second moments of the gradients. This leads to a brittle behavior and misses important areas in the solution space. In this work, we create a multi-objective Adamize method that leverage the benefits of the Adam optimizer in single-objective problems. This corrects and stabilizes the gradients of every objective before calculating a common gradient descent vector that optimizes all the objectives simultaneously. We evaluate the benefits of Multi-objective Adamize on two multi-objective recommender systems and for three different objective combinations, both correlated or uncorrelated. We report significant improvements, measured with three different Pareto front metrics: hypervolume, coverage, and spacing. Finally, we show that the Adamized Pareto front strictly dominates the previous one on multiple objective pairs.


A Markov Decision Process Approach to Active Meta Learning

arXiv.org Machine Learning

In supervised learning, we fit a single statistical model to a given data set, assuming that the data is associated with a singular task, which yields well-tuned models for specific use, but does not adapt well to new contexts. By contrast, in meta-learning, the data is associated with numerous tasks, and we seek a model that may perform well on all tasks simultaneously, in pursuit of greater generalization. One challenge in meta-learning is how to exploit relationships between tasks and classes, which is overlooked by commonly used random or cyclic passes through data. In this work, we propose actively selecting samples on which to train by discerning covariates inside and between meta-training sets. Specifically, we cast the problem of selecting a sample from a number of meta-training sets as either a multi-armed bandit or a Markov Decision Process (MDP), depending on how one encapsulates correlation across tasks. We develop scheduling schemes based on Upper Confidence Bound (UCB), Gittins Index and tabular Markov Decision Problems (MDPs) solved with linear programming, where the reward is the scaled statistical accuracy to ensure it is a time-invariant function of state and action. Across a variety of experimental contexts, we observe significant reductions in sample complexity of active selection scheme relative to cyclic or i.i.d. sampling, demonstrating the merit of exploiting covariates in practice.


Kernel-based parameter estimation of dynamical systems with unknown observation functions

arXiv.org Machine Learning

A low-dimensional dynamical system is observed in an experiment as a high-dimensional signal; For example, a video of a chaotic pendulums system. Assuming that we know the dynamical model up to some unknown parameters, can we estimate the underlying system's parameters by measuring its time-evolution only once? The key information for performing this estimation lies in the temporal inter-dependencies between the signal and the model. We propose a kernel-based score to compare these dependencies. Our score generalizes a maximum likelihood estimator for a linear model to a general nonlinear setting in an unknown feature space. We estimate the system's underlying parameters by maximizing the proposed score. We demonstrate the accuracy and efficiency of the method using two chaotic dynamical systems - the double pendulum and the Lorenz '63 model.


Narrative Maps: An Algorithmic Approach to Represent and Extract Information Narratives

arXiv.org Artificial Intelligence

Narratives are fundamental to our perception of the world and are pervasive in all activities that involve the representation of events in time. Yet, modern online information systems do not incorporate narratives in their representation of events occurring over time. This article aims to bridge this gap, combining the theory of narrative representations with the data from modern online systems. We make three key contributions: a theory-driven computational representation of narratives, a novel extraction algorithm to obtain these representations from data, and an evaluation of our approach. In particular, given the effectiveness of visual metaphors, we employ a route map metaphor to design a narrative map representation. The narrative map representation illustrates the events and stories in the narrative as a series of landmarks and routes on the map. Each element of our representation is backed by a corresponding element from formal narrative theory, thus providing a solid theoretical background to our method. Our approach extracts the underlying graph structure of the narrative map using a novel optimization technique focused on maximizing coherence while respecting structural and coverage constraints. We showcase the effectiveness of our approach by performing a user evaluation to assess the quality of the representation, metaphor, and visualization. Evaluation results indicate that the Narrative Map representation is a powerful method to communicate complex narratives to individuals. Our findings have implications for intelligence analysts, computational journalists, and misinformation researchers.


The Unbalanced Gromov Wasserstein Distance: Conic Formulation and Relaxation

arXiv.org Machine Learning

Comparing metric measure spaces (i.e. a metric space endowed with a probability distribution) is at the heart of many machine learning problems. This includes for instance predicting properties of molecules in quantum chemistry or generating graphs with varying connectivity. The most popular distance between such metric measure spaces is the Gromov-Wasserstein (GW) distance, which is the solution of a quadratic assignment problem. This distance has been successfully applied to supervised learning and generative modeling, for applications as diverse as quantum chemistry or natural language processing. The GW distance is however limited to the comparison of metric measure spaces endowed with a \emph{probability} distribution. This strong limitation is problematic for many applications in ML where there is no a priori natural normalization on the total mass of the data. Furthermore, imposing an exact conservation of mass across spaces is not robust to outliers and often leads to irregular matching. To alleviate these issues, we introduce two Unbalanced Gromov-Wasserstein formulations: a distance and a more computationally tractable upper-bounding relaxation. They both allow the comparison of metric spaces equipped with arbitrary positive measures up to isometries.