Goto

Collaborating Authors

 Learning Graphical Models


What Makes You Tick? Using Machine Learning to Study Social Media Engagement

#artificialintelligence

The first paper I wrote for my PhD just got published! I started my PhD with the goal of critically examining the process and outcomes of social media science communication. Despite the flurry of activities in this domain and the huge amount of resources poured into digital public engagement activities, nobody (and I mean nobody) has ever paused to think, are we making any real change? Is the public more engaged with science and more scientifically literate than say, 10 years ago when Facebook and Twitter weren't the media giants they are today? Given my engineering background, I decided to use the method I know best to approach the problem.


Data Science Dictionary

@machinelearnbot

The idea of cross-validation is to split the data into N subsets, to put one subset aside, to estimate parameters of the model from the remaining N-1 subsets, and to use the retained subset to estimate the error of the model. Such a process is repeated N times - with each of the N subsets being used as the validation set . Then the values of the errors obtained in such N steps are combined to provide the final estimate of the model error. The cross-validation is used in various classification and prediction procedures, such as regression analysis, discriminant analysis, neural networks and classification and regression trees (CART) . The goal is to improve the quality of the decision that is made from the outcome of the study on the basis of statistical methods, and to ensure that maximum information is obtained from scarce experimental data.


Square Hellinger Subadditivity for Bayesian Networks and its Applications to Identity Testing

arXiv.org Machine Learning

We show that the square Hellinger distance between two Bayesian networks on the same directed graph, $G$, is subadditive with respect to the neighborhoods of $G$. Namely, if $P$ and $Q$ are the probability distributions defined by two Bayesian networks on the same DAG, our inequality states that the square Hellinger distance, $H^2(P,Q)$, between $P$ and $Q$ is upper bounded by the sum, $\sum_v H^2(P_{\{v\} \cup \Pi_v}, Q_{\{v\} \cup \Pi_v})$, of the square Hellinger distances between the marginals of $P$ and $Q$ on every node $v$ and its parents $\Pi_v$ in the DAG. Importantly, our bound does not involve the conditionals but the marginals of $P$ and $Q$. We derive a similar inequality for more general Markov Random Fields. As an application of our inequality, we show that distinguishing whether two Bayesian networks $P$ and $Q$ on the same (but potentially unknown) DAG satisfy $P=Q$ vs $d_{\rm TV}(P,Q)>\epsilon$ can be performed from $\tilde{O}(|\Sigma|^{3/4(d+1)} \cdot n/\epsilon^2)$ samples, where $d$ is the maximum in-degree of the DAG and $\Sigma$ the domain of each variable of the Bayesian networks. If $P$ and $Q$ are defined on potentially different and potentially unknown trees, the sample complexity becomes $\tilde{O}(|\Sigma|^{4.5} n/\epsilon^2)$, whose dependence on $n, \epsilon$ is optimal up to logarithmic factors. Lastly, if $P$ and $Q$ are product distributions over $\{0,1\}^n$ and $Q$ is known, the sample complexity becomes $O(\sqrt{n}/\epsilon^2)$, which is optimal up to constant factors.


Advancing Bayesian Optimization: The Mixed-Global-Local (MGL) Kernel and Length-Scale Cool Down

arXiv.org Machine Learning

Bayesian Optimization (BO) has become a core method for solving expensive black-box optimization problems. While much research focussed on the choice of the acquisition function, we focus on online length-scale adaption and the choice of kernel function. Instead of choosing hyperparameters in view of maximum likelihood on past data, we propose to use the acquisition function to decide on hyperparameter adaptation more robustly and in view of the future optimization progress. Further, we propose a particular kernel function that includes non-stationarity and local anisotropy and thereby implicitly integrates the efficiency of local convex optimization with global Bayesian optimization. Comparisons to state-of-the art BO methods underline the efficiency of these mechanisms on global optimization benchmarks.


Robust mixture of experts modeling using the $t$ distribution

arXiv.org Machine Learning

Mixture of Experts (MoE) is a popular framework for modeling heterogeneity in data for regression, classification, and clustering. For regression and cluster analyses of continuous data, MoE usually use normal experts following the Gaussian distribution. However, for a set of data containing a group or groups of observations with heavy tails or atypical observations, the use of normal experts is unsuitable and can unduly affect the fit of the MoE model. We introduce a robust MoE modeling using the $t$ distribution. The proposed $t$ MoE (TMoE) deals with these issues regarding heavy-tailed and noisy data. We develop a dedicated expectation-maximization (EM) algorithm to estimate the parameters of the proposed model by monotonically maximizing the observed data log-likelihood. We describe how the presented model can be used in prediction and in model-based clustering of regression data. The proposed model is validated on numerical experiments carried out on simulated data, which show the effectiveness and the robustness of the proposed model in terms of modeling non-linear regression functions as well as in model-based clustering. Then, it is applied to the real-world data of tone perception for musical data analysis, and the one of temperature anomalies for the analysis of climate change data. The obtained results show the usefulness of the TMoE model for practical applications.


2-D random walks: simulation, video with R source code, curious facts

@machinelearnbot

We have produced a 90-second video (click on this link to view the video) showing a'random walk' (a particular case of a Markov process) evolving over 400,000 steps. Figure 1 below shows the last frame (out of 2,000 frames, each one with 200 new steps). A basic, two-state (going up or down), one-dimensional Markov process is defined as follows: You start at time t 0, walking along the X-axis (representing time). At each iteration (also called step), you move up with probability p, and down with probability q, along the Y-axis. The Y-axis could represent gain/losses in a gamble (throwing a dice), stock market gains etc.


Naïve-Bayes Technique for Machine Learning Blog - BRIDGEi2i Analytics Solutions

#artificialintelligence

"We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances." "When you have two competing theories that make exactly the same predictions, the simpler one is the better." One famous example of Occam's Razor in action is found in conspiracy theories surrounding the NASA moon landings. Many conspiracy theorists believe that the first Moon Landing was staged and filmed in a studio, part of an elaborate hoax. Their justification relies upon many twisted and convoluted theories, whereas the NASA argument is fairly straightforward.


Composing Music with Grammar Argumented Neural Networks and Note-Level Encoding

arXiv.org Artificial Intelligence

Creating aesthetically pleasing pieces of art, including music, has been a long-term goal for artificial intelligence research. Despite recent successes of long-short term memory (LSTM) recurrent neural networks (RNNs) in sequential learning, LSTM neural networks have not, by themselves, been able to generate natural-sounding music conforming to music theory. To transcend this inadequacy, we put forward a novel method for music composition that combines the LSTM with Grammars motivated by music theory. The main tenets of music theory are encoded as grammar argumented (GA) filters on the training data, such that the machine can be trained to generate music inheriting the naturalness of human-composed pieces from the original dataset while adhering to the rules of music theory. Unlike previous approaches, pitches and durations are encoded as one semantic entity, which we refer to as note-level encoding. This allows easy implementation of music theory grammars, as well as closer emulation of the thinking pattern of a musician. Although the GA rules are applied to the training data and never directly to the LSTM music generation, our machine still composes music that possess high incidences of diatonic scale notes, small pitch intervals and chords, in deference to music theory.


Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning

arXiv.org Machine Learning

We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration.


Predicting with confidence: the best machine learning idea you never heard of

#artificialintelligence

One of the disadvantages of machine learning as a discipline is the lack of reasonable confidence intervals on a given prediction. There are all kinds of reasons you might want such a thing, but I think machine learning and data science practitioners are so drunk with newfound powers, they forget where such a thing might be useful. If you're really confident, for example, that someone will click on an ad, you probably want to serve one that pays a nice click through rate. If you have some kind of gambling engine, you want to bet more money on the predictions you are more confident of. Or if you're diagnosing an illness in a patient, it would be awfully nice to be able to tell the patient how certain you are of the diagnosis and what the confidence in the prognosis is. There are various ad hoc ways that people do this sort of thing.