Regression
The effectiveness of factorization and similarity blending
Pinto, Andrea, Camposampiero, Giacomo, Houmard, Loïc, Lundwall, Marc
Collaborative Filtering (CF) is a widely used technique which allows to leverage past users' preferences data to identify behavioural patterns and exploit them to predict custom recommendations. In this work, we illustrate our review of different CF techniques in the context of the Computational Intelligence Lab (CIL) CF project at ETH Z\"urich. After evaluating the performances of the individual models, we show that blending factorization-based and similarity-based approaches can lead to a significant error decrease (-9.4%) on the best-performing stand-alone model. Moreover, we propose a novel stochastic extension of a similarity model, SCSR, which consistently reduce the asymptotic complexity of the original algorithm.
Truthful Generalized Linear Models
Qiu, Yuan, Liu, Jinyan, Wang, Di
In this paper we study estimating Generalized Linear Models (GLMs) in the case where the agents (individuals) are strategic or self-interested and they concern about their privacy when reporting data. Compared with the classical setting, here we aim to design mechanisms that can both incentivize most agents to truthfully report their data and preserve the privacy of individuals' reports, while their outputs should also close to the underlying parameter. In the first part of the paper, we consider the case where the covariates are sub-Gaussian and the responses are heavy-tailed where they only have the finite fourth moments. First, motivated by the stationary condition of the maximizer of the likelihood function, we derive a novel private and closed form estimator. Based on the estimator, we propose a mechanism which has the following properties via some appropriate design of the computation and payment scheme for several canonical models such as linear regression, logistic regression and Poisson regression: (1) the mechanism is $o(1)$-jointly differentially private (with probability at least $1-o(1)$); (2) it is an $o(\frac{1}{n})$-approximate Bayes Nash equilibrium for a $(1-o(1))$-fraction of agents to truthfully report their data, where $n$ is the number of agents; (3) the output could achieve an error of $o(1)$ to the underlying parameter; (4) it is individually rational for a $(1-o(1))$ fraction of agents in the mechanism ; (5) the payment budget required from the analyst to run the mechanism is $o(1)$. In the second part, we consider the linear regression model under more general setting where both covariates and responses are heavy-tailed and only have finite fourth moments. By using an $\ell_4$-norm shrinkage operator, we propose a private estimator and payment scheme which have similar properties as in the sub-Gaussian case.
INTRODUCTION TO SUPERVISED LEARNING
Machine learning is a set of tools that, broadly speaking, allow us to "teach" computers how to perform tasks by providing examples of how they should be done. For example, suppose we wish to write a program to distinguish between valid email messages and unwanted spam. We could try to write a set of simple rules, for example, flagging messages that contain certain features (such as the word "viagra" or obviously-fake headers). However, writing rules to accurately distinguish which text is valid can actually be quite difficult to do well, resulting either in many missed spam messages, or, worse, many lost emails. Worse, the spammers will actively adjust the way they send spam in order to trick these strategies (e.g., writing "vi@gr@"). Writing effective rules -- and keeping them up to date -- quickly becomes an insurmountable task. Fortunately, machine learning has provided a solution. Modern spam filters are "learned" from Examples: we provide the learning algorithm with example emails which we have manually labelled as "ham" (valid email) or "spam" (unwanted email), and the algorithms learn to distinguish between them automatically.
How to Convince Your Boss to Trust Your ML/DL Models
Some company managers or stakeholders are pessimistic about machine learning model predictions. Therefore, it is data scientists' reasonability to convince them that the model prediction is credible and also understandable to humans. Therefore, we need to focus not only on creating powerful machine learning/deep learning models, but also make the models interpretable by humans. Interpretability helps in many ways, such as helping us to understand how a model makes a decision, it justifies model prediction and gaining insights, building trust in the model, and it helps us improve the model. There are two types of ML model interpretation -- global and local. Good Examples of inherently explainable models are linear regression and decision trees.
Pyspark MLlib
Originally published on Towards AI the World's Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses. In the previous sections, we discussed about RDD, Dataframes, and Pyspark concepts.
Learning Value-at-Risk and Expected Shortfall
Barrera, D, Crépey, S, Gobet, E, Nguyen, Hoang-Dung, Saadeddine, B
We propose a non-asymptotic convergence analysis of a two-step approach to learn a conditional value-at-risk (VaR) and expected shortfall (ES) in a nonparametric setting using Rademacher and Vapnik-Chervonenkis bounds. Our approach for the VaR is extended to the problem of learning at once multiple VaRs corresponding to different quantile levels. This results in efficient learning schemes based on neural network quantile and least-squares regressions. An a posteriori Monte Carlo (non-nested) procedure is introduced to estimate distances to the ground-truth VaR and ES without access to the latter. This is illustrated using numerical experiments in a Gaussian toy-model and a financial case-study where the objective is to learn a dynamic initial margin.
Data Science Approach to predict the winning Fantasy Cricket Team Dream 11 Fantasy Sports
S, Sachin Kumar, HV, Prithvi, Nandini, C
The evolution of digital technology and the increasing popularity of sports inspired the innovators to take the experience of users with a proclivity towards sports to a whole new different level, by introducing Fantasy Sports Platforms FSPs. The application of Data Science and Analytics is Ubiquitous in the Modern World. Data Science and Analytics open doors to gain a deeper understanding and help in the decision making process. We firmly believed that we could adopt Data Science to predict the winning fantasy cricket team on the FSP, Dream 11. We built a predictive model that predicts the performance of players in a prospective game. We used a combination of Greedy and Knapsack Algorithms to prescribe the combination of 11 players to create a fantasy cricket team that has the most significant statistical odds of finishing as the strongest team thereby giving us a higher chance of winning the pot of bets on the Dream 11 FSP. We used PyCaret Python Library to help us understand and adopt the best Regressor Algorithm for our problem statement to make precise predictions. Further, we used Plotly Python Library to give us visual insights into the team, and players performances by accounting for the statistical, and subjective factors of a prospective game. The interactive plots help us to bolster the recommendations of our predictive model. You either win big, win small, or lose your bet based on the performance of the players selected for your fantasy team in the prospective game, and our model increases the probability of you winning big.
Prediction Intervals and Confidence Regions for Symbolic Regression Models based on Likelihood Profiles
de Franca, Fabricio Olivetti, Kronberger, Gabriel
Symbolic regression is a nonlinear regression method which is commonly performed by an evolutionary computation method such as genetic programming. Quantification of uncertainty of regression models is important for the interpretation of models and for decision making. The linear approximation and so-called likelihood profiles are well-known possibilities for the calculation of confidence and prediction intervals for nonlinear regression models. These simple and effective techniques have been completely ignored so far in the genetic programming literature. In this work we describe the calculation of likelihood profiles in details and also provide some illustrative examples with models created with three different symbolic regression algorithms on two different datasets. The examples highlight the importance of the likelihood profiles to understand the limitations of symbolic regression models and to help the user taking an informed post-prediction decision.
Stochastic Tree Ensembles for Estimating Heterogeneous Effects
Krantsevich, Nikolay, He, Jingyu, Hahn, P. Richard
Determining subgroups that respond especially well (or poorly) to specific interventions (medical or policy) requires new supervised learning methods tailored specifically for causal inference. Bayesian Causal Forest (BCF) is a recent method that has been documented to perform well on data generating processes with strong confounding of the sort that is plausible in many applications. This paper develops a novel algorithm for fitting the BCF model, which is more efficient than the previously available Gibbs sampler. The new algorithm can be used to initialize independent chains of the existing Gibbs sampler leading to better posterior exploration and coverage of the associated interval estimates in simulation studies. The new algorithm is compared to related approaches via simulation studies as well as an empirical analysis.