Goto

Collaborating Authors

 Bayesian Learning


Support Feature Machines

arXiv.org Machine Learning

Support Vector Machines (SVMs) with various kernels have played dominant role in machine learning for many years, finding numerous applications. Although they have many attractive features interpretation of their solutions is quite difficult, the use of a single kernel type may not be appropriate in all areas of the input space, convergence problems for some kernels are not uncommon, the standard quadratic programming solution has $O(m^3)$ time and $O(m^2)$ space complexity for $m$ training patterns. Kernel methods work because they implicitly provide new, useful features. Such features, derived from various kernels and other vector transformations, may be used directly in any machine learning algorithm, facilitating multiresolution, heterogeneous models of data. Therefore Support Feature Machines (SFM) based on linear models in the extended feature spaces, enabling control over selection of support features, give at least as good results as any kernel-based SVMs, removing all problems related to interpretation, scaling and convergence. This is demonstrated for a number of benchmark datasets analyzed with linear discrimination, SVM, decision trees and nearest neighbor methods.


The CM Algorithm for the Maximum Mutual Information Classifications of Unseen Instances

arXiv.org Machine Learning

The Maximum Mutual Information (MMI) criterion is different from the Least Error Rate (LER) criterion. It can reduce failing to report small probability events. This paper introduces the Channels Matching (CM) algorithm for the MMI classifications of unseen instances. It also introduces some semantic information methods, which base the CM algorithm. In the CM algorithm, label learning is to let the semantic channel match the Shannon channel (Matching I) whereas classifying is to let the Shannon channel match the semantic channel (Matching II). We can achieve the MMI classifications by repeating Matching I and II. For low-dimensional feature spaces, we only use parameters to construct n likelihood functions for n different classes (rather than to construct partitioning boundaries as gradient descent) and expresses the boundaries by numerical values. Without searching in parameter spaces, the computation of the CM algorithm for low-dimensional feature spaces is very simple and fast. Using a two-dimensional example, we test the speed and reliability of the CM algorithm by different initial partitions. For most initial partitions, two iterations can make the mutual information surpass 99% of the convergent MMI. The analysis indicates that for high-dimensional feature spaces, we may combine the CM algorithm with neural networks to improve the MMI classifications for faster and more reliable convergence.


Bayesian Learning of Neural Network Architectures

arXiv.org Machine Learning

In this paper we propose a Bayesian method for estimating architectural parameters of neural networks, namely layer size and network depth. We do this by learning concrete distributions over these parameters. Our results show that regular networks with a learnt structure can generalise better on small datasets, while fully stochastic networks can be more robust to parameter initialisation. The proposed method relies on standard neural variational learning and, unlike randomised architecture search, does not require a retraining of the model, thus keeping the computational overhead at minimum.


Markov Properties of Discrete Determinantal Point Processes

arXiv.org Machine Learning

Determinantal point processes (DPPs) are probabilistic models for repulsion. When used to represent the occurrence of random subsets of a finite base set, DPPs allow to model global negative associations in a mathematically elegant and direct way. Discrete DPPs have become popular and computationally tractable models for solving several machine learning tasks that require the selection of diverse objects, and have been successfully applied in numerous real-life problems. Despite their popularity, the statistical properties of such models have not been adequately explored. In this note, we derive the Markov properties of discrete DPPs and show how they can be expressed using graphical models.


Improved Causal Discovery from Longitudinal Data Using a Mixture of DAGs

arXiv.org Machine Learning

Many causal processes in biomedicine contain cycles and evolve. However, most causal discovery algorithms assume that the underlying causal process follows a single directed acyclic graph (DAG) that does not change over time. The algorithms can therefore infer erroneous causal relations with high confidence when run on real biomedical data. In this paper, I relax the single DAG assumption by modeling causal processes using a mixture of DAGs so that the graph can change over time. I then describe a causal discovery algorithm called Causal Inference over Mixtures (CIM) to infer causal structure from a mixture of DAGs using longitudinal data. CIM improves the accuracy of causal discovery on both real and synthetic clinical datasets even when cycles, non-stationarity, non-linearity, latent variables and selection bias exist simultaneously.


Bayesian surrogate learning in dynamic simulator-based regression problems

arXiv.org Machine Learning

The estimation of unknown values of parameters (or hidden variables, control variables) that characterise a physical system often relies on the comparison of measured data with synthetic data produced by some numerical simulator of the system as the parameter values are varied. This process often encounters two major difficulties: the generation of synthetic data for each considered set of parameter values can be computationally expensive if the system model is complicated; and the exploration of the parameter space can be inefficient and/or incomplete, a typical example being when the exploration becomes trapped in a local optimum of the objection function that characterises the mismatch between the measured and synthetic data. A method to address both these issues is presented, whereby: a surrogate model (or proxy), which emulates the computationally expensive system simulator, is constructed using deep recurrent networks (DRN); and a nested sampling (NS) algorithm is employed to perform efficient and robust exploration of the parameter space. The analysis is performed in a Bayesian context, in which the samples characterise the full joint posterior distribution of the parameters, from which parameter estimates and uncertainties are easily derived. The proposed approach is compared with conventional methods in some numerical examples, for which the results demonstrate that one can accelerate the parameter estimation process by at least an order of magnitude.


Graphical-model based estimation and inference for differential privacy

arXiv.org Machine Learning

Many privacy mechanisms reveal high-level information about a data distribution through noisy measurements. It is common to use this information to estimate the answers to new queries. In this work, we provide an approach to solve this estimation problem efficiently using graphical models, which is particularly effective when the distribution is high-dimensional but the measurements are over low-dimensional marginals. We show that our approach is far more efficient than existing estimation techniques from the privacy literature and that it can improve the accuracy and scalability of many state-of-the-art mechanisms.


Bayes metaclassifier and Soft-confusion-matrix classifier in the task of multi-label classification

arXiv.org Machine Learning

The aim of this paper was to compare soft confusion matrix approach and Bayes metaclassifier under the multi-label classification framework. Although the methods were successfully applied under the multi-label classification framework, they have not been compared directly thus far. Such comparison is of vital importance because both methods are quite similar as they are both based on the concept of randomized reference classifier. Since both algorithms were designed to deal with single-label problems, they are combined with the problem-transformation approach to multi-label classification. Present study included 29 benchmark datasets and four different base classifiers. The algorithms were compared in terms of 11 quality criteria and the results were subjected to statistical analysis.


Ask less - Scale Market Research without Annoying Your Customers

arXiv.org Machine Learning

Abstract--Market research is generally performed by surveying arepresentative sample of customers with questions that includes contexts such as psycho-graphics, demographics, attitude and product preferences. Survey responses are used to segment the customers into various groups that are useful for targeted marketing and communication. Reducing the number of questions asked to the customer has utility for businesses to scale the market research to a large number of customers. We demonstrate the effectiveness of our approach using an example market segmentation of broadband customers. I. INTRODUCTION A key technique for developing successful business strategies inbusiness to customer (B2C) companies is to develop a good understanding of the market and the customer behavior.


Unified estimation framework for unnormalized models with statistical efficiency

arXiv.org Machine Learning

Parameter estimation of unnormalized models is a challenging problem because normalizing constants are not calculated explicitly and maximum likelihood estimation is computationally infeasible. Although some consistent estimators have been proposed earlier, the problem of statistical efficiency does remain. In this study, we propose a unified, statistically efficient estimation framework for unnormalized models and several novel efficient estimators with reasonable computational time regardless of whether the sample space is discrete or continuous. The loss functions of the proposed estimators are derived by combining the following two methods: (1) density-ratio matching using Bregman divergence, and (2) plugging-in nonparametric estimators. We also analyze the properties of the proposed estimators when the unnormalized model is misspecified. Finally, the experimental results demonstrate the advantages of our method over existing approaches.