Goto

Collaborating Authors

 Learning Graphical Models


Introduction to Naive Bayes

#artificialintelligence

I think there's a rule somewhere that says "You can't call yourself a data scientist until you've used a Naive Bayes classifier". This article is my attempt at laying the groundwork for Naive Bayes in a practical and intuitive fashion. Let's start with a problem to motivate our formulation of Naive Bayes. Suppose we own a professional networking site similar to LinkedIn. Users sign up, type some information about themselves, and then roam the network looking for jobs/connections/etc. Until recently, we only required users to enter their current job title, but now we're asking them what industry they work in.


On the Prior Sensitivity of Thompson Sampling

arXiv.org Machine Learning

The empirically successful Thompson Sampling algorithm for stochastic bandits has drawn much interest in understanding its theoretical properties. One important benefit of the algorithm is that it allows domain knowledge to be conveniently encoded as a prior distribution to balance exploration and exploitation more effectively. While it is generally believed that the algorithm's regret is low (high) when the prior is good (bad), little is known about the exact dependence. In this paper, we fully characterize the algorithm's worst-case dependence of regret on the choice of prior, focusing on a special yet representative case. These results also provide insights into the general sensitivity of the algorithm to the choice of priors. In particular, with $p$ being the prior probability mass of the true reward-generating model, we prove $O(\sqrt{T/p})$ and $O(\sqrt{(1-p)T})$ regret upper bounds for the bad- and good-prior cases, respectively, as well as \emph{matching} lower bounds. Our proofs rely on the discovery of a fundamental property of Thompson Sampling and make heavy use of martingale theory, both of which appear novel in the literature, to the best of our knowledge.



An Introduction to Language Modeling With N-Grams and Markov Chains

#artificialintelligence

See the full presentation, slides, and notes. N 1: "Unigram (or, you know, a word)" ie: "The" "A Markov chain is a probabilistic model well suited to semi-coherent text synthesis." While straining to live in Ukraine with anxiety and broad range of my surroundings, along the ones I felt physically threatened and the rush I burst into a ten-year old who they sought a poem that matters, I was I should be invincible. Who would have paid for granted, but maybe it was asked to further education is an annual overnight to San Diego, water fun, cheers, a year, I still burn in the invisible enemy in the night when I cannot feel the traffic outside the times I want to a missionary would be neither relived nor reanimated. I assume the status quo, seems fair; I were a stylish figure, for me, and knees.


Multi-category Angle-based Classifier Refit

arXiv.org Machine Learning

Classification is an important statistical learning tool. In real application, besides high prediction accuracy, it is often desirable to estimate class conditional probabilities for new observations. For traditional problems where the number of observations is large, there exist many well developed approaches. Recently, high dimensional low sample size problems are becoming increasingly popular. Margin-based classifiers, such as logistic regression, are well established methods in the literature. On the other hand, in terms of probability estimation, it is known that for binary classifiers, the commonly used methods tend to under-estimate the norm of the classification function. This can lead to biased probability estimation. Remedy approaches have been proposed in the literature. However, for the simultaneous multicategory classification framework, much less work has been done. We fill the gap in this paper. In particular, we give theoretical insights on why heavy regularization terms are often needed in high dimensional applications, and how this can lead to bias in probability estimation. To overcome this difficulty, we propose a new refit strategy for multicategory angle-based classifiers. Our new method only adds a small computation cost to the problem, and is able to attain prediction accuracy that is as good as the regular margin-based classifiers. On the other hand, the improvement of probability estimation can be very significant. Numerical results suggest that the new refit approach is highly competitive.


Computing Your Skill

#artificialintelligence

Summary: I describe how the TrueSkill algorithm works using concepts you're already familiar with. TrueSkill is used on Xbox Live to rank and match players and it serves as a great way to understand how statistical machine learning is actually applied today. I've also created an open source project where I implemented TrueSkill three different times in increasing complexity and capability. In addition, I've created a detailed supplemental math paper that works out equations that I gloss over here. Feel free to jump to sections that look interesting and ignore ones that seem boring. Don't worry if this post seems a bit long, there are lots of pictures. It seemed easy enough: I wanted to create a database to track the skill levels of my coworkers in chess and foosball. I already knew that I wasn't very good at foosball and would bring down better players. I was curious if an algorithm could do a better job at creating well-balanced matches. I also wanted to see if I was improving at chess. I knew I needed to have an easy way to collect results from everyone and then use an algorithm that would keep getting better with more data. I was looking for a way to compress all that data and distill it down to some simple knowledge of how skilled people are. Based on some previous things that I had heard about, this seemed like a good fit for "machine learning." Machine learning is a hot area in Computer Science-- but it's intimidating. Like most subjects, there's a lot to learn to be an expert in the field. I didn't need to go very deep; I just needed to understand enough to solve my problem. I found a link to the paper describing the TrueSkill algorithm and I read it several times, but it didn't make sense. It was only 8 pages long, but it seemed beyond my capability to understand.


A Beginner's Tutorial for Restricted Boltzmann Machines - Deeplearning4j: Open-source, distributed deep learning for the JVM

#artificialintelligence

Invented by Geoff Hinton, a Restricted Boltzmann machine is an algorithm useful for dimensionality reduction, classification, regression, collaborative filtering, feature learning and topic modeling. Given their relative simplicity and historical importance, restricted Boltzmann machines are the first neural network we'll tackle. In the paragraphs below, we describe in diagrams and plain language how they work. RBMs are shallow, two-layer neural nets that constitute the building blocks of deep-belief networks. The first layer of the RBM is called the visible, or input, layer, and the second is the hidden layer. Each circle in the graph above represents a neuron-like unit called a node, and nodes are simply where calculations take place.


Graphical Model Sketch

arXiv.org Machine Learning

Structured high-cardinality data arises in many domains, and poses a major challenge for both modeling and inference. Graphical models are a popular approach to modeling structured data but they are unsuitable for high-cardinality variables. The count-min (CM) sketch is a popular approach to estimating probabilities in high-cardinality data but it does not scale well beyond a few variables. In this work, we bring together the ideas of graphical models and count sketches; and propose and analyze several approaches to estimating probabilities in structured high-cardinality streams of data. The key idea of our approximations is to use the structure of a graphical model and approximately estimate its factors by "sketches", which hash high-cardinality variables using random projections. Our approximations are computationally efficient and their space complexity is independent of the cardinality of variables. Our error bounds are multiplicative and significantly improve upon those of the CM sketch, a state-of-the-art approach to estimating probabilities in streams. We evaluate our approximations on synthetic and real-world problems, and report an order of magnitude improvements over the CM sketch.


GP-select: Accelerating EM using adaptive subspace preselection

arXiv.org Machine Learning

We propose a nonparametric procedure to achieve fast inference in generative graphical models when the number of latent states is very large. The approach is based on iterative latent variable preselection, where we alternate between learning a 'selection function' to reveal the relevant latent variables, and use this to obtain a compact approximation of the posterior distribution for EM; this can make inference possible where the number of possible latent states is e.g. exponential in the number of latent variables, whereas an exact approach would be computationally unfeasible. We learn the selection function entirely from the observed data and current EM state via Gaussian process regression. This is by contrast with earlier approaches, where selection functions were manually-designed for each problem setting. We show that our approach performs as well as these bespoke selection functions on a wide variety of inference problems: in particular, for the challenging case of a hierarchical model for object localization with occlusion, we achieve results that match a customized state-of-the-art selection method, at a far lower computational cost.


Maximum Entropy Learning with Deep Belief Networks

#artificialintelligence

Understanding how a nervous system computes requires determining the input, the output, and the transformations necessary to convert the input into the desired output [1]. Artificial neural networks are a conceptual framework that provide insight into how these transformations are carried out, and have also played a crucial factor in the success of many pattern recognition tasks such as for handwriting [2] and object [3] detection. An important feature of neural networks is their ability to capture the underlying regularities in a task domain by representing the input with multiple layers of active neurons. This distributed representation of the input is based on the hierarchal processing and information flow of biological systems [4,5]. In a multi-layered network, complex internal representations can also be constructed by repeatedly adjusting the weights of the connections in order to ensure that the output is close to the desired output [6].