Learning Graphical Models
Property-driven State-Space Coarsening for Continuous Time Markov Chains
Michaelides, Michalis, Milios, Dimitrios, Hillston, Jane, Sanguinetti, Guido
Dynamical systems with large state-spaces are often expensive to thoroughly explore experimentally. Coarse-graining methods aim to define simpler systems which are more amenable to analysis and exploration; most current methods, however, focus on a priori state aggregation based on similarities in transition rates, which is not necessarily reflected in similar behaviours at the level of trajectories. We propose a way to coarsen the state-space of a system which optimally preserves the satisfaction of a set of logical specifications about the system's trajectories. Our approach is based on Gaussian Process emulation and Multi-Dimensional Scaling, a dimensionality reduction technique which optimally preserves distances in non-Euclidean spaces. We show how to obtain low-dimensional visualisations of the system's state-space from the perspective of properties' satisfaction, and how to define macro-states which behave coherently with respect to the specifications. Our approach is illustrated on a non-trivial running example, showing promising performance and high computational efficiency.
Compact Compositional Models
Learning compact and interpretable representations is a very natural task, which has not been solved satisfactorily even for simple binary datasets. In this paper, we review various ways of composing experts for binary data and argue that competitive forms of interaction are best suited to learn low-dimensional representations. We propose a new composition rule that discourages experts from focusing on similar structures and that penalizes opposing votes strongly so that abstaining from voting becomes more attractive. We also introduce a novel sequential initialization procedure, which is based on a process of oversimplification and correction. Experiments show that with our approach very intuitive models can be learned.
The 10 Algorithms Machine Learning Engineers Need to Know
It is no doubt that the sub-field of machine learning / artificial intelligence has increasingly gained more popularity in the past couple of years. As Big Data is the hottest trend in the tech industry at the moment, machine learning is incredibly powerful to make predictions or calculated suggestions based on large amounts of data. Some of the most common examples of machine learning are Netflix's algorithms to make movie suggestions based on movies you have watched in the past or Amazon's algorithms that recommend books based on books you have bought before. So if you want to learn more about machine learning, how do you start? For me, my first introduction is when I took an Artificial Intelligence class when I was studying abroad in Copenhagen.
On the Latent Variable Interpretation in Sum-Product Networks
Peharz, Robert, Gens, Robert, Pernkopf, Franz, Domingos, Pedro
One of the central themes in Sum-Product networks (SPNs) is the interpretation of sum nodes as marginalized latent variables (LVs). This interpretation yields an increased syntactic or semantic structure, allows the application of the EM algorithm and to efficiently perform MPE inference. In literature, the LV interpretation was justified by explicitly introducing the indicator variables corresponding to the LVs' states. However, as pointed out in this paper, this approach is in conflict with the completeness condition in SPNs and does not fully specify the probabilistic model. We propose a remedy for this problem by modifying the original approach for introducing the LVs, which we call SPN augmentation. We discuss conditional independencies in augmented SPNs, formally establish the probabilistic interpretation of the sum-weights and give an interpretation of augmented SPNs as Bayesian networks. Based on these results, we find a sound derivation of the EM algorithm for SPNs. Furthermore, the Viterbi-style algorithm for MPE proposed in literature was never proven to be correct. We show that this is indeed a correct algorithm, when applied to selective SPNs, and in particular when applied to augmented SPNs. Our theoretical results are confirmed in experiments on synthetic data and 103 real-world datasets.
Fuzzy Bayesian Learning
Abstract--In this paper we propose a novel approach for learning from data using rule based fuzzy inference systems where the model parameters are estimated using Bayesian inference and Markov Chain Monte Carlo (MCMC) techniques. We show the applicability of the method for regression and classification tasks using synthetic data-sets and also a real world example in the financial services industry. Then we demonstrate how the method can be extended for knowledge extraction to select the individual rules in a Bayesian way which best explains the given data. Finally we discuss the advantages and pitfalls of using this method over state-of-the-art techniques and highlight the specific class of problems where this would be useful. ROBABILITY theory and fuzzy logic have been shown to be complementary [1] and various works have looked at the symbiotic integration of these two paradigms [2], [3] including the recently introduced concept of Z-numbers [4]. Historically fuzzy logic has been applied to problems involving imprecision in linguistic variables, while probability theory has been used for quantifying uncertainty in a wide range of disciplines. V arious generalisations and extensions of fuzzy sets have been proposed to incorporate uncertainty and vagueness which arise from multiple sources. For example, the type-2 fuzzy [5], [6] sets and type-n fuzzy sets [5] can include uncertainty while defining the membership functions themselves. Intuitionistic fuzzy sets [7] additionally introduce the degree of non-membership of an element to take into account that there might be some hesitation degree and the degree of membership and non-membership of an element might not always add to one. Non-stationary fuzzy sets [8] can model variation of opinion over time by defining a collection of type 1 fuzzy sets and an explicit relationship between them. Fuzzy multi-sets [9] generalise crisp sets where multiple occurrences of an element are permitted. Hesitant fuzzy sets [10] have been proposed from the motivation that the problem of assigning a degree of membership to an element is not because of a margin of error (like Atanassov's intuitionistic fuzzy sets) or a possibility distribution on possibility values (e.g. Formally these can be viewed as fuzzy multi-sets but with a different interpretation.
Scaling Factorial Hidden Markov Models: Stochastic Variational Inference without Messages
Ng, Yin Cheng, Chilinski, Pawel, Silva, Ricardo
Factorial Hidden Markov Models (FHMMs) are powerful models for sequential data but they do not scale well with long sequences. We propose a scalable inference and learning algorithm for FHMMs that draws on ideas from the stochastic variational inference, neural network and copula literatures. Unlike existing approaches, the proposed algorithm requires no message passing procedure among latent variables and can be distributed to a network of computers to speed up learning. Our experiments corroborate that the proposed algorithm does not introduce further approximation bias compared to the proven structured mean-field algorithm, and achieves better performance with long sequences and large FHMMs.
Theoretical Comparisons of Positive-Unlabeled Learning against Positive-Negative Learning
Niu, Gang, Plessis, Marthinus Christoffel du, Sakai, Tomoya, Ma, Yao, Sugiyama, Masashi
In PU learning, a binary classifier is trained from positive (P) and unlabeled (U) data without negative (N) data. Although N data is missing, it sometimes outperforms PN learning (i.e., ordinary supervised learning). Hitherto, neither theoretical nor experimental analysis has been given to explain this phenomenon. In this paper, we theoretically compare PU (and NU) learning against PN learning based on the upper bounds on estimation errors. We find simple conditions when PU and NU learning are likely to outperform PN learning, and we prove that, in terms of the upper bounds, either PU or NU learning (depending on the class-prior probability and the sizes of P and N data) given infinite U data will improve on PN learning. Our theoretical findings well agree with the experimental results on artificial and benchmark data even when the experimental setup does not match the theoretical assumptions exactly.
PAC Reinforcement Learning with Rich Observations
Krishnamurthy, Akshay, Agarwal, Alekh, Langford, John
We propose and study a new model for reinforcement learning with rich observations, generalizing contextual bandits to sequential decision making. These models require an agent to take actions based on observations (features) with the goal of achieving long-term performance competitive with a large set of policies. To avoid barriers to sample-efficient learning associated with large observation spaces and general POMDPs, we focus on problems that can be summarized by a small number of hidden states and have long-term rewards that are predictable by a reactive function class. In this setting, we design and analyze a new reinforcement learning algorithm, Least Squares Value Elimination by Exploration. We prove that the algorithm learns near optimal behavior after a number of episodes that is polynomial in all relevant parameters, logarithmic in the number of policies, and independent of the size of the observation space. Our result provides theoretical justification for reinforcement learning with function approximation.
About Feature Scaling and Normalization
The result of standardization (or Z-score normalization) is that the features will be rescaled so that they'll have the properties of a standard normal distribution with Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms. Intuitively, we can think of gradient descent as a prominent example (an optimization algorithm often used in logistic regression, SVMs, perceptrons, neural networks etc.); with features being on different scales, certain weights may update faster than others since the feature values play a role in the weight updates Other intuitive examples include K-Nearest Neighbor algorithms and clustering algorithms that use, for example, Euclidean distance measures – in fact, tree-based classifier are probably the only classifiers where feature scaling doesn't make a difference. In fact, the only family of algorithms that I could think of being scale-invariant are tree-based methods. Let's take the general CART decision tree algorithm. Without going into much depth regarding information gain and impurity measures, we can think of the decision as "is feature x_i some_val?"
Deep Learning: Definition, Resources, Comparison with Machine Learning
Deep learning is sometimes referred to as the intersection between machine learning and artificial intelligence. It is about designing algorithms that can make robots intelligent, such a face recognition techniques used in drones to detect and target terrorists, or pattern recognition / computer vision algorithms to automatically pilot a plane, a train, a boat or a car. Many deep learning algorithms (clustering, pattern recognition, automated bidding, recommendation engine, and so on) -- even though they appear in new contexts such as IoT or machine to machine communication -- still rely on relatively old-fashioned techniques such as logistic regression, SVM, decision trees, K-NN, naive Bayes, Bayesian modeling, ensembles, random forests, signal processing, filtering, graph theory, gaming theory, and many others. Some are new, such as indexation algorithms to automate digital publishing, improve search engines, or create and manage large catalogs such as Amazon's product listing. As a result, many deep learning practitioners call themselves data scientist, computer scientist, statistician, or sometimes engineer.