Goto

Collaborating Authors

 Regression


A Beginner's Guide to Sentiment Analysis with Python

#artificialintelligence

Sentiment analysis is a technique that detects the underlying sentiment in a piece of text. It is the process of classifying text as either positive, negative, or neutral. Machine learning techniques are used to evaluate a piece of text and determine the sentiment behind it. Sentiment analysis is essential for businesses to gauge customer response. Picture this: Your company has just released a new product that is being advertised on a number of different channels.


Semi-supervised learning and the question of true versus estimated propensity scores

arXiv.org Machine Learning

A straightforward application of semi-supervised machine learning to the problem of treatment effect estimation would be to consider data as "unlabeled" if treatment assignment and covariates are observed but outcomes are unobserved. According to this formulation, large unlabeled data sets could be used to estimate a high dimensional propensity function and causal inference using a much smaller labeled data set could proceed via weighted estimators using the learned propensity scores. In the limiting case of infinite unlabeled data, one may estimate the high dimensional propensity function exactly. However, longstanding advice in the causal inference community suggests that estimated propensity scores (from labeled data alone) are actually preferable to true propensity scores, implying that the unlabeled data is actually useless in this context. In this paper we examine this paradox and propose a simple procedure that reconciles the strong intuition that a known propensity functions should be useful for estimating treatment effects with the previous literature suggesting otherwise. Further, simulation studies suggest that direct regression may be preferable to inverse-propensity weight estimators in many circumstances.


Matching in Selective and Balanced Representation Space for Treatment Effects Estimation

arXiv.org Machine Learning

The dramatically growing availability of observational data is being witnessed in various domains of science and technology, which facilitates the study of causal inference. However, estimating treatment effects from observational data is faced with two major challenges, missing counterfactual outcomes and treatment selection bias. Matching methods are among the most widely used and fundamental approaches to estimating treatment effects, but existing matching methods have poor performance when facing data with high dimensional and complicated variables. We propose a feature selection representation matching (FSRM) method based on deep representation learning and matching, which maps the original covariate space into a selective, nonlinear, and balanced representation space, and then conducts matching in the learned representation space. FSRM adopts deep feature selection to minimize the influence of irrelevant variables for estimating treatment effects and incorporates a regularizer based on the Wasserstein distance to learn balanced representations. We evaluate the performance of our FSRM method on three datasets, and the results demonstrate superiority over the state-of-the-art methods.


A Vertical Federated Learning Method for Interpretable Scorecard and Its Application in Credit Scoring

arXiv.org Machine Learning

With the success of big data and artificial intelligence in many fields, the applications of big data driven models are expected in financial risk management especially credit scoring and rating. Under the premise of data privacy protection, we propose a projected gradient-based method in the vertical federated learning framework for the traditional scorecard, which is based on logistic regression with bounded constraints, namely FL-LRBC. The latter enables multiple agencies to jointly train an optimized scorecard model in a single training session. It leads to the formation of the model with positive coefficients, while the time-consuming parameter-tuning process can be avoided. Moreover, the performance in terms of both AUC and the Kolmogorov-Smirnov (KS) statistics is significantly improved due to data enrichment using FL-LRBC. At present, FL-LRBC has already been applied to credit business in a China nation-wide financial holdings group.


To Bag is to Prune

arXiv.org Machine Learning

It is notoriously hard to build a bad Random Forest (RF). Concurrently, RF is perhaps the only standard ML algorithm that blatantly overfits in-sample without any consequence out-of-sample. Standard arguments cannot rationalize this paradox. I propose a new explanation: bootstrap aggregation and model perturbation as implemented by RF automatically prune a (latent) true underlying tree. More generally, there is no need to tune the stopping point of a properly randomized ensemble of greedily optimized base learners. Thus, Boosting and MARS are eligible for automatic (implicit) tuning. I empirically demonstrate the property, with simulated and real data, by reporting that these new completely overfitting ensembles yield an out-of-sample performance equivalent to that of their tuned counterparts -- or better.


Semiparametric Estimation and Inference on Structural Target Functions using Machine Learning and Influence Functions

arXiv.org Machine Learning

We aim to construct a class of learning algorithms that are of practical value to applied researchers in fields such as biostatistics, epidemiology and econometrics, where the need to learn from incompletely observed information is ubiquitous. To do so, we propose a new framework for statistical machine learning, which we call 'IF-learning' due to its reliance on influence functions (IFs). To characterise the fundamental limits of what is achievable within this framework, we need to enable semiparametric estimation and inference on structural target parameters that are functions of continuous inputs arising as identifiable functionals from statistical models. Therefore, we introduce a pointwise IF to replace the true IF when it does not exist and propose learning its uncentered pointwise expected value from data. This allows us to give provable guarantees, leveraging existing general results from statistics. Our framework is problem- and model-agnostic and can be used to estimate a broad variety of target parameters of interest in applied statistics: we can consider any target function for which an IF of a population-averaged version exists in analytic form. Throughout, we put particular focus on so-called coarsening at random/doubly robust problems with partially unobserved information. This includes problems such as treatment effect estimation and inference in the presence of missing outcome data. Within this framework, we then propose two general learning algorithms that leverage ideas from the theoretical analysis: the 'IF-learner' which relies on large samples and outputs entire target functions without confidence bands, and the 'Group-IF-learner', which outputs only approximations to a function but can give confidence estimates if sufficient information on coarsening mechanisms is available. We close with a simulation study on inferring treatment effects.


Regression modelling with I-priors

arXiv.org Machine Learning

We introduce the I-prior methodology as a unifying framework for estimating a variety of regression models, including varying coefficient, multilevel, longitudinal models, and models with functional covariates and responses. It can also be used for multi-class classification, with low or high dimensional covariates. The I-prior is generally defined as a maximum entropy prior. For a regression function, the I-prior is Gaussian with covariance kernel proportional to the Fisher information on the regression function, which is estimated by its posterior distribution under the I-prior. The I-prior has the intuitively appealing property that the more information is available on a linear functional of the regression function, the larger the prior variance, and the smaller the influence of the prior mean on the posterior distribution. Advantages compared to competing methods, such as Gaussian process regression or Tikhonov regularization, are ease of estimation and model comparison. In particular, we develop an EM algorithm with a simple E and M step for estimating hyperparameters, facilitating estimation for complex models. We also propose a novel parsimonious model formulation, requiring a single scale parameter for each (possibly multidimensional) covariate and no further parameters for interaction effects. This simplifies estimation because fewer hyperparameters need to be estimated, and also simplifies model comparison of models with the same covariates but different interaction effects; in this case, the model with the highest estimated likelihood can be selected. Using a number of widely analyzed real data sets we show that predictive performance of our methodology is competitive. An R-package implementing the methodology is available (Jamil, 2019).


Top 8 Data Mining Techniques In Machine Learning

#artificialintelligence

Data mining is considered to be one of the popular terms of machine learning as it extracts meaningful information from the large pile of datasets and is used for decision-making tasks. It is a technique to identify patterns in a pre-built database and is used quite extensively by organisations as well as academia. The various aspects of data mining include data cleaning, data integration, data transformation, data discretisation, pattern evaluation and more. Below, we have listed the top eight data mining techniques in machine learning that is most used by data scientists. Association Rule Learning is one of the unsupervised data mining techniques in which an item set is defined as a collection of one or more items.


Little-known Linear Regression Assumptions

#artificialintelligence

The model should conform to these assumptions to produce a best Linear Regression fit to the data. At first, Linear Regression is a method of modelling the best linear relationship between the independent variables and dependent variables. The predictor variables are seen as fixed values and can be any complex function like polynomial, trigonometric, etc. But the coefficients will be strictly linear with the predictor variable. This assumption is used for implementing the Polynomial regression, which uses linear regression to fit the response variable as an arbitrary polynomial function of a predictor variable which also makes the linear relationship with the coefficients.


10 Machine Learning Algorithms You Need to Know

#artificialintelligence

If you've just started to explore the ways that machine learning can impact your business, the first questions you're likely to come across are what are all of the different types of machine learning algorithms, what are they good for, and which one should I choose for my project? This post will help you answer those questions. There are a few different ways to categorize machine learning algorithms. One way is based on what the training data looks like. Another way to classify algorithms--and one that's more practical from a business perspective--is to categorize them based on how they work and what kinds of problems they can solve, which is what we'll do here.