Directed Networks
Iterative Bayesian Learning for Crowdsourced Regression
Ok, Jungseul, Oh, Sewoong, Jang, Yunhun, Shin, Jinwoo, Yi, Yung
Crowdsourcing platforms emerged as popular venues for purchasing human intelligence at low cost for large volumes of tasks. As many low-paid workers are prone to give noisy answers, one of the fundamental questions is how to identify more reliable workers and exploit this heterogeneity to infer the true answers accurately. Despite significant research efforts for classification tasks with discrete answers, little attention has been paid to regression tasks with continuous answers. The popular Dawid-Skene model for discrete answers has the algorithmic and mathematical simplicity in relation to low-rank structures. But it does not generalize for continuous valued answers. To this end, we introduce a new probabilistic model for crowdsourced regression capturing the heterogeneity of the workers, generalizing the Dawid-Skene model to the continuous domain. We design a message-passing algorithm for Bayesian inference inspired by the popular belief propagation algorithm. We showcase its performance first by proving that it achieves a near optimal mean squared error by comparing it to an oracle estimator. Asymptotically, we can provide a tighter analysis showing that the proposed algorithm achieves the exact optimal performance. We next show synthetic experiments confirming our theoretical predictions. As a practical application, we further emulate a crowdsourcing system reproducing PASCAL visual object classes datasets and show that de-noising the crowdsourced data from the proposed scheme can significantly improve the performance for the vision task.
Christopher Fonnesbeck - Introduction to Statistical Modeling with Python - PyCon 2017
"Speaker: Christopher Fonnesbeck This intermediate-level tutorial will provide students with hands-on experience applying practical statistical modeling methods on real data. Unlike many introductory statistics courses, we will not be applying ""cookbook"" methods that are easy to teach, but often inapplicable; instead, we will learn some foundational statistical methods that can be applied generally to a wide variety of problems: maximum likelihood, bootstrapping, linear regression, and other modern techniques. The tutorial will start with a short introduction on data manipulation and cleaning using [pandas](http://pandas.pydata.org/), Slightly more advanced topics include bootstrapping (for estimating uncertainty around estimates) and flexible linear regression methods using Bayesian methods. By using and modifying hand-coded implementations of these techniques, students will gain an understanding of how each method works.
Stopword removal (suprisingly) decreases accuracy of naive-bayes model
Stop words typically remove such things as "a, an, the, it". Often this can be beneficial when we are classifying based on topics, which are well described by nouns and adjectives. However some text classification tasks are more abstract. Consider classifying fiction and non-fiction articles on the same topic, what would the difference between these two writing styles be? They would probably use the same nouns but what about the frequency of "the" vs "an" or "he" vs "they"?
Approximate Inference with Amortised MCMC
Li, Yingzhen, Turner, Richard E., Liu, Qiang
We propose a novel approximate inference framework that approximates a target distribution by amortising the dynamics of a user-selected Markov chain Monte Carlo (MCMC) sampler. The idea is to initialise MCMC using samples from an approximation network, apply the MCMC operator to improve these samples, and finally use the samples to update the approximation network thereby improving its quality. This provides a new generic framework for approximate inference, allowing us to deploy highly complex, or implicitly defined approximation families with intractable densities, including approximations produced by warping a source of randomness through a deep neural network. Experiments consider Bayesian neural network classification and image modelling with deep generative models. Deep models trained using amortised MCMC are shown to generate realistic looking samples as well as producing diverse imputations for images with regions of missing pixels.
The 10 Algorithms Machine Learning Engineers Need to Know
It is no doubt that the sub-field of machine learning / artificial intelligence has increasingly gained more popularity in the past couple of years. As Big Data is the hottest trend in the tech industry at the moment, machine learning is incredibly powerful to make predictions or calculated suggestions based on large amounts of data. Some of the most common examples of machine learning are Netflix's algorithms to make movie suggestions based on movies you have watched in the past or Amazon's algorithms that recommend books based on books you have bought before. So if you want to learn more about machine learning, how do you start? For me, my first introduction is when I took an Artificial Intelligence class when I was studying abroad in Copenhagen. My lecturer is a full-time Applied Math and CS professor at the Technical University of Denmark, in which his research areas are logic and artificial, focusing primarily on the use of logic to model human-like planning, reasoning and problem solving.
Bayesian Machine Learning, Explained
So you know the Bayes rule. How does it relate to machine learning? It can be quite difficult to grasp how the puzzle pieces fit together - we know it took us a while. This article is an introduction we wish we had back then. While we have some grasp on the matter, we're not experts, so the following might contain inaccuracies or even outright errors. Feel free to point them out, either in the comments or privately.
DoorDash sees 25% lift from AI recommendations
Food delivery company DoorDash says personalized restaurant recommendations based on AI are seeing a significant lift in orders, compared to regular recommendations based on popularity. In an interview with VentureBeat, DoorDash product manager Jimmy Liu said customers who saw personalized recommendations on average "were over 25 percent more likely" to place an order versus people who saw the most popular restaurants in their area. We talked with Liu on the eve of the company's announcement today that it's rolling out these machine-learning based recommendations to all of its users, after testing it on increasing percentages of its customer base. Millions of users have already seen the recommendations, the company said. Liu said the 25 percent lift from recommendations came specifically from email campaigns.
The Kernel Mixture Network: A Nonparametric Method for Conditional Density Estimation of Continuous Random Variables
Ambrogioni, Luca, Gรผรงlรผ, Umut, van Gerven, Marcel A. J., Maris, Eric
This paper introduces the kernel mixture network, a new method for nonparametric estimation of conditional probability densities using neural networks. We model arbitrarily complex conditional densities as linear combinations of a family of kernel functions centered at a subset of training points. The weights are determined by the outer layer of a deep neural network, trained by minimizing the negative log likelihood. This generalizes the popular quantized softmax approach, which can be seen as a kernel mixture network with square and non-overlapping kernels. We test the performance of our method on two important applications, namely Bayesian filtering and generative modeling. In the Bayesian filtering example, we show that the method can be used to filter complex nonlinear and non-Gaussian signals defined on manifolds. The resulting kernel mixture network filter outperforms both the quantized softmax filter and the extended Kalman filter in terms of model likelihood. Finally, our experiments on generative models show that, given the same architecture, the kernel mixture network leads to higher test set likelihood, less overfitting and more diversified and realistic generated samples than the quantized softmax approach.
CDS Rate Construction Methods by Machine Learning Techniques
Brummelhuis, Raymond, Luo, Zhongmin
Regulators require financial institutions to estimate counterparty default risks from liquid CDS quotes for the valuation and risk management of OTC derivatives. However, the vast majority of counterparties do not have liquid CDS quotes and need proxy CDS rates. Existing methods cannot account for counterparty-specific default risks; we propose to construct proxy CDS rates by associating to illiquid counterparty liquid CDS Proxy based on Machine Learning Techniques. After testing 156 classifiers from 8 most popular classifier families, we found that some classifiers achieve highly satisfactory accuracy rates. Furthermore, we have rank-ordered the performances and investigated performance variations amongst and within the 8 classifier families. This paper is, to the best of our knowledge, the first systematic study of CDS Proxy construction by Machine Learning techniques, and the first systematic classifier comparison study based entirely on financial market data. Its findings both confirm and contrast existing classifier performance literature. Given the typically highly correlated nature of financial data, we investigated the impact of correlation on classifier performance. The techniques used in this paper should be of interest for financial institutions seeking a CDS Proxy method, and can serve for proxy construction for other financial variables. Some directions for future research are indicated.
Data Science Dictionary
The idea of cross-validation is to split the data into N subsets, to put one subset aside, to estimate parameters of the model from the remaining N-1 subsets, and to use the retained subset to estimate the error of the model. Such a process is repeated N times - with each of the N subsets being used as the validation set . Then the values of the errors obtained in such N steps are combined to provide the final estimate of the model error. The cross-validation is used in various classification and prediction procedures, such as regression analysis, discriminant analysis, neural networks and classification and regression trees (CART) . The goal is to improve the quality of the decision that is made from the outcome of the study on the basis of statistical methods, and to ensure that maximum information is obtained from scarce experimental data.