Bayesian Learning
Instagram Fake and Automated Account Detection
Akyon, Fatih Cagatay, Kalfaoglu, Esat
Fake engagement is one of the significant problems in Online Social Networks (OSNs) which is used to increase the popularity of an account in an inorganic manner. The detection of fake engagement is crucial because it leads to loss of money for businesses, wrong audience targeting in advertising, wrong product predictions systems, and unhealthy social network environment. This study is related with the detection of fake and automated accounts which leads to fake engagement on Instagram. As far as we know, there is no publicly available dataset for fake and automated accounts. For this purpose, two datasets have been generated for the detection of fake and automated accounts. For the detection of these accounts, machine learning algorithms like Naive Bayes, logistic regression, support vector machines and neural networks are applied. Additionally, for the detection of automated accounts, cost sensitive genetic algorithm is applied because of the unnatural bias in the dataset. To deal with the unevenness problem in the fake dataset, Smote-nc algorithm is implemented. For the automated and fake account detection problem, 86% and 96% are obtained, respectively.
d-blink: Distributed End-to-End Bayesian Entity Resolution
Marchant, Neil G., Steorts, Rebecca C., Kaplan, Andee, Rubinstein, Benjamin I. P., Elazar, Daniel N.
Entity resolution (ER) (record linkage or de-duplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while providing exact uncertainty quantification and tight performance bounds. Despite these advancements, existing models do not scale to realistically-sized databases (larger than 1000 records) and they do not incorporate probabilistic blocking. In this paper, we propose "distributed Bayesian linkage" or d-blink -- the first scalable and distributed end-to-end Bayesian model for ER, which propagates uncertainty in blocking, matching and merging. We make several novel contributions, including: (i) incorporating probabilistic blocking directly into the model through auxiliary partitions; (ii) support for missing values; (iii) a partially-collapsed Gibbs sampler; and (iv) a novel perturbation sampling algorithm (leveraging the Vose-Alias method) that enables fast updates of the entity attributes. Finally, we conduct experiments on five data sets which show that d-blink can achieve significant efficiency gains -- in excess of 300$\times$ -- when compared to existing non-distributed methods.
Population-aware Hierarchical Bayesian Domain Adaptation via Multiple-component Invariant Learning
Mhasawade, Vishwali, Rehman, Nabeel Abdur, Chunara, Rumi
While machine learning is rapidly being developed and deployed in health settings such as influenza prediction, there are critical challenges in using data from one environment in another due to variability in features; even within disease labels there can be differences (e.g. "fever" may mean something different reported in a doctor's office versus in an online app). Moreover, models are often built on passive, observational data which contain different distributions of population subgroups (e.g. men or women). Thus, there are two forms of instability between environments in this observational transport problem. We first harness knowledge from health to conceptualize the underlying causal structure of this problem in a health outcome prediction task. Based on sources of stability in the model, we posit that for human-sourced data and health prediction tasks we can combine environment and population information in a novel population-aware hierarchical Bayesian domain adaptation framework that harnesses multiple invariant components through population attributes when needed. We study the conditions under which invariant learning fails, leading to reliance on the environment-specific attributes. Experimental results for an influenza prediction task on four datasets gathered from different contexts show the model can improve prediction in the case of largely unlabelled target data from a new environment and different constituent population, by harnessing both environment and population invariant information. This work represents a novel, principled way to address a critical challenge by blending domain (health) knowledge and algorithmic innovation. The proposed approach will have a significant impact in many social settings wherein who and where the data comes from matters.
A Gentle Introduction to Uncertainty in Machine Learning
Applied machine learning requires managing uncertainty. There are many sources of uncertainty in a machine learning project, including variance in the specific data values, the sample of data collected from the domain, and in the imperfect nature of any models developed from such data. Managing the uncertainty that is inherent in machine learning for predictive modeling can be achieved via the tools and techniques from probability, a field specifically designed to handle uncertainty. In this post, you will discover the challenge of uncertainty in machine learning. A Gentle Introduction to Uncertainty in Machine Learning Photo by Anastasiy Safari, some rights reserved.
Unsupervised Recalibration
Ziegler, Albert, Czyลผ, Paweล
Unsupervised recalibration (URC) is a general way to improve the accuracy of an already trained probabilistic classification or regression model upon encountering new data while deployed in the field. URC does not require any ground truth associated with the new field data. URC merely observes the model's predictions and recognizes when the training set is not representative of field data, and then corrects to remove any introduced bias. URC can be particularly useful when applied separately to different subpopulations observed in the field that were not considered as features when training the machine learning model. This makes it possible to exploit subpopulation information without retraining the model or even having ground truth for some or all subpopulations available. Additionally, if these subpopulations are the object of study, URC serves to determine the correct ground truth distributions for them, where naive aggregation methods, like averaging the model's predictions, systematically underestimate their differences.
A Note on Posterior Probability Estimation for Classifiers
Nalbantov, Georgi, Ivanov, Svetoslav
One of the central themes in the classification task is the estimation of class posterior probability at a new point $\bf{x}$. The vast majority of classifiers output a score for $\bf{x}$, which is monotonically related to the posterior probability via an unknown relationship. There are many attempts in the literature to estimate this latter relationship. Here, we provide a way to estimate the posterior probability without resorting to using classification scores. Instead, we vary the prior probabilities of classes in order to derive the ratio of pdf's at point $\bf{x}$, which is directly used to determine class posterior probabilities. We consider here the binary classification problem.
Regularized Estimation and Feature Selection in Mixtures of Gaussian-Gated Experts Models
Chamroukhi, Faรฏcel, Lecocq, Florian, Nguyen, Hien D.
Mixtures-of-Experts models and their maximum likelihood estimation (MLE) via the EM algorithm have been thoroughly studied in the statistics and machine learning literature. They are subject of a growing investigation in the context of modeling with high-dimensional predictors with regularized MLE. We examine MoE with Gaussian gating network, for clustering and regression, and propose an $\ell_1$-regularized MLE to encourage sparse models and deal with the high-dimensional setting. We develop an EM-Lasso algorithm to perform parameter estimation and utilize a BIC-like criterion to select the model parameters, including the sparsity tuning hyperparameters. Experiments conducted on simulated data show the good performance of the proposed regularized MLE compared to the standard MLE with the EM algorithm.
Modular Meta-Learning with Shrinkage
Chen, Yutian, Friesen, Abram L., Behbahani, Feryal, Budden, David, Hoffman, Matthew W., Doucet, Arnaud, de Freitas, Nando
Most gradient-based approaches to meta-learning do not explicitly account for the fact that different parts of the underlying model adapt by different amounts when applied to a new task. For example, the input layers of an image classification convnet typically adapt very little, while the output layers can change significantly. This can cause parts of the model to begin to overfit while others underfit. To address this, we introduce a hierarchical Bayesian model with per-module shrinkage parameters, which we propose to learn by maximizing an approximation of the predictive likelihood using implicit differentiation. Our algorithm subsumes Reptile and outperforms variants of MAML on two synthetic few-shot meta-learning problems.
Maximum Likelihood Constraint Inference for Inverse Reinforcement Learning
Scobee, Dexter R. R., Sastry, S. Shankar
While most approaches to the problem of Inverse Reinforcement Learning (IRL) focus on estimating a reward function that best explains an expert agent's policy or demonstrated behavior on a control task, it is often the case that such behavior is more succinctly described by a simple reward combined with a set of hard constraints. In this setting, the agent is attempting to maximize cumulative rewards subject to these given constraints on their behavior. We reformulate the problem of IRL on Markov Decision Processes (MDPs) such that, given a nominal model of the environment and a nominal reward function, we seek to estimate state, action, and feature constraints in the environment that motivate an agent's behavior. Our approach is based on the Maximum Entropy IRL framework, which allows us to reason about the likelihood of an expert agent's demonstrations given our knowledge of an MDP. Using our method, we can infer which constraints can be added to the MDP to most increase the likelihood of observing these demonstrations. We present an algorithm which iteratively infers the Maximum Likelihood Constraint to best explain observed behavior, and we evaluate its efficacy using both simulated behavior and recorded data of humans navigating around an obstacle.
Machine Learning Basics
Before we start this article on machine learning basics, let us take an example to understand the impact of machine learning in the world. We can safely assume that machine learning has been a dominant force in today's world and has accelerated our progress in all fields. No matter which industry you look at, machine learning has dramatically altered it. Let's take an example from the world of trading. Man Group's AHL Dimension programme is a $5.1 billion dollar hedge fund which is partially managed by AI. After it started off, by the year 2015, its machine learning algorithms were contributing more than half of the profits of the fund even though the assets under its management were far less. Machine learning has become a hot topic today, with professionals all over the world signing up for ML or AI courses for fear of being left behind. But exactly what is machine learning? It will be clear to you when you have reached the end of this article. Machine Learning, as the name suggests, provides machines with the ability to learn autonomously based on experiences, observations and analysing patterns within a given data set without explicitly programming. When we write a program or a code for some specific purpose, we are actually writing a definite set of instructions which the machine will follow. Whereas in machine learning, we input a data set through which the machine will learn by identifying and analysing the patterns in the data set and learn to take decisions autonomously based on its observations and learnings from the dataset.