Learning Graphical Models
An optimal learning method for developing personalized treatment regimes
A treatment regime is a function that maps individual patient information to a recommended treatment, hence explicitly incorporating the heterogeneity in need for treatment across individuals. Patient responses are dichotomous and can be predicted through an unknown relationship that depends on the patient information and the selected treatment. The goal is to find the treatments that lead to the best patient responses on average. Each experiment is expensive, forcing us to learn the most from each experiment. We adopt a Bayesian approach both to incorporate possible prior information and to update our treatment regime continuously as information accrues, with the potential to allow smaller yet more informative trials and for patients to receive better treatment. By formulating the problem as contextual bandits, we introduce a knowledge gradient policy to guide the treatment assignment by maximizing the expected value of information, for which an approximation method is used to overcome computational challenges. We provide a detailed study on how to make sequential medical decisions under uncertainty to reduce health care costs on a real world knee replacement dataset. We use clustering and LASSO to deal with the intrinsic sparsity in health datasets. We show experimentally that even though the problem is sparse, through careful selection of physicians (versus picking them at random), we can significantly improve the success rates.
One-Shot Session Recommendation Systems with Combinatorial Items
David, Yahel, Di Castro, Dotan, Karnin, Zohar
In recent years, content recommendation systems in large websites (or \emph{content providers}) capture an increased focus. While the type of content varies, e.g.\ movies, articles, music, advertisements, etc., the high level problem remains the same. Based on knowledge obtained so far on the user, recommend the most desired content. In this paper we present a method to handle the well known user-cold-start problem in recommendation systems. In this scenario, a recommendation system encounters a new user and the objective is to present items as relevant as possible with the hope of keeping the user's session as long as possible. We formulate an optimization problem aimed to maximize the length of this initial session, as this is believed to be the key to have the user come back and perhaps register to the system. In particular, our model captures the fact that a single round with low quality recommendation is likely to terminate the session. In such a case, we do not proceed to the next round as the user leaves the system, possibly never to seen again. We denote this phenomenon a \emph{One-Shot Session}. Our optimization problem is formulated as an MDP where the action space is of a combinatorial nature as we recommend in each round, multiple items. This huge action space presents a computational challenge making the straightforward solution intractable. We analyze the structure of the MDP to prove monotone and submodular like properties that allow a computationally efficient solution via a method denoted by \emph{Greedy Value Iteration} (G-VI).
A Beginner's Tutorial for Restricted Boltzmann Machines - Deeplearning4j: Open-source, distributed deep learning for the JVM
Invented by Geoff Hinton, a Restricted Boltzmann machine is an algorithm useful for dimensionality reduction, classification, regression, collaborative filtering, feature learning and topic modeling. Given their relative simplicity and historical importance, restricted Boltzmann machines are the first neural network we'll tackle. In the paragraphs below, we describe in diagrams and plain language how they work. RBMs are shallow, two-layer neural nets that constitute the building blocks of deep-belief networks. The first layer of the RBM is called the visible, or input, layer, and the second is the hidden layer. Each circle in the graph above represents a neuron-like unit called a node, and nodes are simply where calculations take place.
How to Build a Neuron: Exploring AI in JavaScript Pt 2 -- JavaScript Scene
In this series, we're discussing a topic that will transform the world we live in over the course of the next 25 years. We're going to see lots of drones, self driving cars, VR, and AR devices changing how we get around, how we transport things, and how we see and interact with the world, and it will all be powered by AI and neural nets. In part 1, we talked a little bit about what neurons are and how they work, and wrapped it up by showing a trivial example of how to sum synapse inputs and determine whether or not the neuron should fire, and finished off the article by suggesting a question: What about time? From here on out I'll be recording these adventures in a library called neurolib. If you're at all familiar with traditional neural nets, you're probably wondering when I'm going to start talking about gradient descent or Hidden Markov Models (HMM).
Bayesian machine learning - FastML
So you know the Bayes rule. How does it relate to machine learning? It can be quite difficult to grasp how the puzzle pieces fit together - we know it took us a while. This article is an introduction we wish we had back then. While we have some grasp on the matter, we're not experts, so the following might contain inaccuracies or even outright errors. Feel free to point them out, either in the comments or privately.
Missing Data Estimation in High-Dimensional Datasets: A Swarm Intelligence-Deep Neural Network Approach
Leke, Collins, Marwala, Tshilidzi
In this paper, we examine the problem of missing data in high-dimensional datasets by taking into consideration the Missing Completely at Random and Missing at Random mechanisms, as well as the Arbitrary missing pattern. Additionally, this paper employs a methodology based on Deep Learning and Swarm Intelligence algorithms in order to provide reliable estimates for missing data. The deep learning technique is used to extract features from the input data via an unsupervised learning approach by modeling the data distribution based on the input. This deep learning technique is then used as part of the objective function for the swarm intelligence technique in order to estimate the missing data after a supervised fine-tuning phase by minimizing an error function based on the interrelationship and correlation between features in the dataset. The investigated methodology in this paper therefore has longer running times, however, the promising potential outcomes justify the tradeoff. Also, basic knowledge of statistics is presumed.
A Kernelized Stein Discrepancy for Goodness-of-fit Tests and Model Evaluation
Liu, Qiang, Lee, Jason D., Jordan, Michael I.
We derive a new discrepancy statistic for measuring differences between two probability distributions based on combining Stein's identity with the reproducing kernel Hilbert space theory. We apply our result to test how well a probabilistic model fits a set of observations, and derive a new class of powerful goodness-of-fit tests that are widely applicable for complex and high dimensional distributions, even for those with computationally intractable normalization constants. Both theoretical and empirical properties of our methods are studied thoroughly.
Unsupervised Learning with Imbalanced Data via Structure Consolidation Latent Variable Model
Yousefi, Fariba, Dai, Zhenwen, Ek, Carl Henrik, Lawrence, Neil
Unsupervised learning on imbalanced data is challenging because, when given imbalanced data, current model is often dominated by the major category and ignores the categories with small amount of data. We develop a latent variable model that can cope with imbalanced data by dividing the latent space into a shared space and a private space. Based on Gaussian Process Latent Variable Models, we propose a new kernel formulation that enables the separation of latent space and derives an efficient variational inference method. The performance of our model is demonstrated with an imbalanced medical image dataset.
Optimising The Input Window Alignment in CD-DNN Based Phoneme Recognition for Low Latency Processing
Dhaka, Akash Kumar, Salvi, Giampiero
A key factor that determines the usability of applications based on speech recognition is the latency or lag of the system. In dialogue systems, e.g., long latencies may disrupt the natural turntaking in the human-machine conversation. In other specific applications the lag may even be more critical. A typical example involves systems that use ASR to drive the lip movements of an avatar in real time to support telepresence [3, 4, 5]. The latency in a typical speech recogniser based on a hybrid between Neural Networks (NNs) and Hidden Markov Models (HMMs) is determined by a number of factors: - the hardware (sound card) introduces some lag in digitising the speech samples and making them available to the drivers. Typical values are in the order of milliseconds; - the speech samples are returned by the driver in buffers of a certain size (this could be as long as half a second, but can be reduced to a few ms); - in spectral based feature extraction, speech samples are grouped into windows (frames) often around 25-40 ms in length; - many methods for feature extraction also compute time derivatives of the features, which require a number of frames in the past and the future.
A Single-Pass Classifier for Categorical Data
This paper describes a new method for classifying a dataset that partitions elements into their categories. It has relations with neural networks but a slightly different structure, requiring only a single pass through the classifier to generate the weight sets. A grid-like structure is required as part of a novel idea of converting a 1-D row of real values into a 2-D structure of value bands. Each cell in any band then stores a distinct set of weights, to represent its own importance and its relation to each output category. During classification, all of the output weight lists can be retrieved and summed to produce a probability for what the correct output category is. The bands possibly work like hidden layers of neurons, but they are variable specific, making the process orthogonal. The construction process can be a single update process without iterations, making it potentially much faster. It can also be compared with k-NN and may be practical for partial or competitive updating.