Goto

Collaborating Authors

 Undirected Networks


Delayed acceptance ABC-SMC

arXiv.org Machine Learning

Approximate Bayesian computation (ABC) is now an established technique for statistical inference used in cases where the likelihood function is computationally expensive or not available. It relies on the use of a model that is specified in the form of a simulator, and approximates the likelihood at a parameter $\theta$ by simulating auxiliary data sets $x$ and evaluating the distance of $x$ from the true data $y$. However, ABC is not computationally feasible in cases where using the simulator for each $\theta$ is very expensive. This paper investigates this situation in cases where a cheap, but approximate, simulator is available. The approach is to employ delayed acceptance Markov chain Monte Carlo (MCMC) within an ABC sequential Monte Carlo (SMC) sampler in order to, in a first stage of the kernel, use the cheap simulator to rule out parts of the parameter space that are not worth exploring, so that the "true" simulator is only run (in the second stage of the kernel) where there is a reasonable chance of accepting proposed values of $\theta$. We show that this approach can be used quite automatically, with the only tuning parameter choice additional to ABC-SMC being the number of particles we wish to carry through to the second stage of the kernel. Applications to stochastic differential equation models and latent doubly intractable distributions are presented.


Estimating speech from lip dynamics

arXiv.org Machine Learning

The goal of this project is to develop a limited lip reading algorithm for a subset of the English language. We consider a scenario in which no audio information is available. The raw video is processed and the position of the lips in each frame is extracted. We then prepare the lip data for processing and classify the lips into visemes and phonemes. Hidden Markov Models are used to predict the words the speaker is saying based on the sequences of classified phonemes and visemes. The GRID audiovisual sentence corpus [10][11] database is used for our study.


Latent tree models

arXiv.org Machine Learning

Latent tree models are graphical models defined on trees, in which only a subset of variables is observed. They were first discussed by Judea Pearl as tree-decomposable distributions to generalise star-decomposable distributions such as the latent class model. Latent tree models, or their submodels, are widely used in: phylogenetic analysis, network tomography, computer vision, causal modeling, and data clustering. They also contain other well-known classes of models like hidden Markov models, Brownian motion tree model, the Ising model on a tree, and many popular models used in phylogenetics. We offer here a concise introduction to the theory of latent tree models. We emphasise the role of tree metrics in the structural description of this model class, in designing learning algorithms, and in understanding fundamental limits of what and when can be learned. We present Gaussian and general Markov models as subclasses of latent tree models that admits tractable and rigorous analysis. A leaf of T is a vertex of degree one, an internal vertex is a vertex which is not a leaf, and an inner edge is an edge whose both ends are internal vertices. Given a treeT define a rooted tree as a directed graph obtained from T by picking one of its verticesr and directing all edges away fromr . The vertexr is called the root. Trees will be always leaf-labeled with the labelling set{ 1,...,m}, where m is the number of leaves. An undirected tree is trivalent if each internal vertex has degree precisely three. A rooted tree is a binary rooted tree if each internal vertex has precisely two children. In many applications rooted trees are depicted without using arrows, where direction is made implicit by drawing the root on the top and the leaves on the bottom; see Figure 1(c). Two special types of undirected trees are: a star tree with one internal vertex and a trivalent tree on four leaves called a quartet tree; see Figure 1(a) and (b). A forest is a collection of trees. Forests here are also leaf-labeled with the labelling set is{ 1,...,m}, which means that each tree in this collection is leaf-labeled and the corresponding collection of labelling sets forms a set partition of { 1,...,m}. We define three graph operations on trees (forests). Removing an edge means removing that edge from the edge set. Contracting an edge u v means removingu,v from the vertex set, adding a new vertexw and edges such thatw is adjacent to all vertices which were adjacent tou or v. Suppressing a vertex of degree two means removing that vertex and replacing the two edges incident to that vertex by a single edge. 1 2 3 4 5 1 2 3 4 (a) (b) (c) Figure 1: (a) An undirected star tree with five leaves, (b) a quartet tree, (c) a binary rooted tree.


Phase Diagram of Restricted Boltzmann Machines and Generalised Hopfield Networks with Arbitrary Priors

arXiv.org Machine Learning

Restricted Boltzmann Machines are described by the Gibbs measure of a bipartite spin glass, which in turn corresponds to the one of a generalised Hopfield network. This equivalence allows us to characterise the state of these systems in terms of retrieval capabilities, both at low and high load. We study the paramagnetic-spin glass and the spin glass-retrieval phase transitions, as the pattern (i.e. weight) distribution and spin (i.e. unit) priors vary smoothly from Gaussian real variables to Boolean discrete variables. Our analysis shows that the presence of a retrieval phase is robust and not peculiar to the standard Hopfield model with Boolean patterns. The retrieval region is larger when the pattern entries and retrieval units get more peaked and, conversely, when the hidden units acquire a broader prior and therefore have a stronger response to high fields. Moreover, at low load retrieval always exists below some critical temperature, for every pattern distribution ranging from the Boolean to the Gaussian case.


Chatbots: Theory and Practice – Intuition Machine – Medium

#artificialintelligence

There's a lot of fluff surrounding chatbots, so I wrote this post to lay out the basics. I first review the theory of conversation to give us a sense of what we are aiming for. I then discuss three classes of chatbots. The simplest class is purposeless mimicry agents, which only provide the illusion of conversation. Members of this class include ELIZA and chatbots based on deep learning sequence-to-sequence models. The second and next most sophisticated class comprises intention-based agents such as Amazon's Alexa and Apple's Siri. These agents have a simple understanding and can do real stuff, but they generally can't have multi-turn conversations. The third and most sophisticated class is conversational agents that can keep track of what has been said in the conversation and can switch topics when the human user desires. Conversation begins with shared reference.


Dynamic Clustering Algorithms via Small-Variance Analysis of Markov Chain Mixture Models

arXiv.org Machine Learning

Bayesian nonparametrics are a class of probabilistic models in which the model size is inferred from data. A recently developed methodology in this field is small-variance asymptotic analysis, a mathematical technique for deriving learning algorithms that capture much of the flexibility of Bayesian nonparametric inference algorithms, but are simpler to implement and less computationally expensive. Past work on small-variance analysis of Bayesian nonparametric inference algorithms has exclusively considered batch models trained on a single, static dataset, which are incapable of capturing time evolution in the latent structure of the data. This work presents a small-variance analysis of the maximum a posteriori filtering problem for a temporally varying mixture model with a Markov dependence structure, which captures temporally evolving clusters within a dataset. Two clustering algorithms result from the analysis: D-Means, an iterative clustering algorithm for linearly separable, spherical clusters; and SD-Means, a spectral clustering algorithm derived from a kernelized, relaxed version of the clustering problem. Empirical results from experiments demonstrate the advantages of using D-Means and SD-Means over contemporary clustering algorithms, in terms of both computational cost and clustering accuracy.


Probabilistic Graphical Models for Credibility Analysis in Evolving Online Communities

arXiv.org Machine Learning

One of the major hurdles preventing the full exploitation of information from online communities is the widespread concern regarding the quality and credibility of user-contributed content. Prior works in this domain operate on a static snapshot of the community, making strong assumptions about the structure of the data (e.g., relational tables), or consider only shallow features for text classification. To address the above limitations, we propose probabilistic graphical models that can leverage the joint interplay between multiple factors in online communities --- like user interactions, community dynamics, and textual content --- to automatically assess the credibility of user-contributed online content, and the expertise of users and their evolution with user-interpretable explanation. To this end, we devise new models based on Conditional Random Fields for different settings like incorporating partial expert knowledge for semi-supervised learning, and handling discrete labels as well as numeric ratings for fine-grained analysis. This enables applications such as extracting reliable side-effects of drugs from user-contributed posts in healthforums, and identifying credible content in news communities. Online communities are dynamic, as users join and leave, adapt to evolving trends, and mature over time. To capture this dynamics, we propose generative models based on Hidden Markov Model, Latent Dirichlet Allocation, and Brownian Motion to trace the continuous evolution of user expertise and their language model over time. This allows us to identify expert users and credible content jointly over time, improving state-of-the-art recommender systems by explicitly considering the maturity of users. This also enables applications such as identifying helpful product reviews, and detecting fake and anomalous reviews with limited information.


Parameter identification in Markov chain choice models

arXiv.org Machine Learning

In assortment planning, the seller's goal is to select a subset of products (called an assortment) to offer to a customer so as to maximize the expected revenue. This task can be formulated as an optimization problem given the revenue generated from selling each product, along with a probabilistic model of the customer's preferences for the products. Such a discrete choice model must capture the customer's substitution behavior when, for instance, the offered assortment does not contain the customer's most preferred product. Our focus in this paper is the Markov chain choice model (MCCM) proposed by Blanchet et al. (2016). In this model, the product selected by the customer is determined by a Markov chain over products where the products in the offered assortment are absorbing states. The current state represents the desired product; if that product is not offered, the customer transitions to another product according to the Markov chain probabilities, and the process continues until the desired product is offered or the customer leaves. MCCM generalizes widely-used discrete choice models such as the multinomial logit model (Luce, 1959; Plackett, 1975), as well as other generalized attraction models (Gallego et al., 2014); it also well-approximates other random utility models found in the literature such as mixed multinomial logit models (McFadden and Train, 2000). At the same time, the MCCM permits computationally efficient unconstrained assortment optimization as well as efficient approximation algorithms in the constrained case (Blanchet et al., 2016; Désir et al., 2015); this stands in contrast to some richer models such as mixed multinomial logit models (Rusmevichientong et al., 2010) and the nested logit model (Davis et al., 2014) for which assortment optimization is generally intractable. This combination of expressiveness and computational tractability makes MCCM very attractive for use in assortment planning.


An Infinite Hidden Markov Model With Similarity-Biased Transitions

arXiv.org Machine Learning

We describe a generalization of the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) which is able to encode prior information that state transitions are more likely between "nearby" states. This is accomplished by defining a similarity function on the state space and scaling transition probabilities by pair-wise similarities, thereby inducing correlations among the transition distributions. We present an augmented data representation of the model as a Markov Jump Process in which: (1) some jump attempts fail, and (2) the probability of success is proportional to the similarity between the source and destination states. This augmentation restores conditional conjugacy and admits a simple Gibbs sampler. We evaluate the model and inference method on a speaker diarization task and a "harmonic parsing" task using four-part chorale data, as well as on several synthetic datasets, achieving favorable comparisons to existing models.


Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains

arXiv.org Machine Learning

We consider the minimization of an objective function given access to unbiased estimates of the function gradients. This key methodological problem has raised interest in different communities: in large-scale machine learning (Bottou and Bousquet, 2008; Shalev-Shwartz et al., 2009, 2007), optimization (Nemirovski et al., 2009; Nesterov and Vial, 2008), and stochastic approximation (Kushner and Yin, 2003; Polyak and Juditsky, 1992; Ruppert, 1988). The most widely used algorithms are stochastic gradient descent(SGD), a.k.a. Robbins-Monro algorithm(Robbins and Monro, 1951), and some of its modifications based on averaging of the iterates (Polyak and Juditsky, 1992; Rakhlin et al., 2011; Shamir and Zhang, 2013). While the choice of the step-size may be done robustly in the deterministic case (see, e.g., Bertsekas, 1995), this remains a traditional theoretical and practical issue in the stochastic case. Indeed, early work suggested to use step-size decaying with the number k of iterations as O(1/k) (Robbins and Monro, 1951), but it appeared to be non-robust to ill-conditioning and slower decays such as O(1/ k) together with averaging lead to both good practical and theoretical performance (Bach, 2014). We consider in this paper constant step-size SGD, which is often used in practice. Although the algorithm is not converging in general to the global optimum of the objective function, constant step-sizes come with benefits: (a) there is single parameter value to set as opposed to the several choices of parameters to deal with decaying step-sizes, e.g., as 1/( k)