Goto

Collaborating Authors

 Regression


Predicting Gross Movie Revenue

arXiv.org Machine Learning

'There is no terror in the bang, only is the anticipation of it' - Alfred Hitchcock. Yet there is everything in correctly anticipating the bang a movie would make in the box-office. Movies make a high profile, billion dollar industry and prediction of movie revenue can be very lucrative. Predicted revenues can be used for planning both the production and distribution stages. For example, projected gross revenue can be used to plan the remuneration of the actors and crew members as well as other parts of the budget [1]. Success or failure of a movie can depend on many factors: star-power, release date, budget, MPAA (Motion Picture Association of America) rating, plot and the highly unpredictable human reactions. The enormity of the number of exogenous variables makes manual revenue prediction process extremely difficult. However, in the era of computer and data sciences, volumes of data can be efficiently processed and modelled. Hence the tough job of predicting gross revenue of a movie can be simplified with the help of modern computing power and the historical data available as movie databases [2].


Evaluating Hospital Case Cost Prediction Models Using Azure Machine Learning Studio

arXiv.org Machine Learning

Ability for accurate hospital case cost modelling and prediction is critical for efficient health care financial management and budgetary planning. A variety of regression machine learning algorithms are known to be effective for health care cost predictions. The purpose of this experiment was to build an Azure Machine Learning Studio tool for rapid assessment of multiple types of regression models. The tool offers environment for comparing 14 types of regression models in a unified experiment: linear regression, Bayesian linear regression, decision forest regression, boosted decision tree regression, neural network regression, Poisson regression, Gaussian processes for regression, gradient boosted machine, nonlinear least squares regression, projection pursuit regression, random forest regression, robust regression, robust regression with mm-type estimators, support vector regression. The tool presents assessment results arranged by model accuracy in a single table using five performance metrics. Evaluation of regression machine learning models for performing hospital case cost prediction demonstrated advantage of robust regression model, boosted decision tree regression and decision forest regression. The operational tool has been published to the web and openly available for experiments and extensions.


Hospital Readmission Prediction - Applying Hierarchical Sparsity Norms for Interpretable Models

arXiv.org Machine Learning

Hospital readmissions have become one of the key measures of healthcare quality. Preventable readmissions have been identified as one of the primary targets for reducing costs and improving healthcare delivery. However, most data driven studies for understanding readmissions have produced black box classification and predictive models with moderate performance, which precludes them from being used effectively within the decision support systems in the hospitals. In this paper we present an application of structured sparsity-inducing norms for predicting readmission risk for patients based on their disease history and demographics. Most existing studies have focused on hospital utilization, test results, etc., to assign a readmission label to each episode of hospitalization. However, we focus on assigning a readmission risk label to a patient based on their disease history. Our emphasis is on interpreting the models to improve the understanding of the readmission problem. To achieve this, we exploit the domain induced hierarchical structure available for the disease codes which are the features for the classification algorithm. We use a tree based sparsity-inducing regularization strategy that explicitly uses the domain hierarchy. The resulting model not only outperforms standard regularization procedures but is also highly sparse and interpretable. We analyze the model and identify several significant factors that have an effect on readmission risk. Some of these factors conform to existing beliefs, e.g., impact of surgical complications and infections during hospital stay. Other factors, such as the impact of mental disorder and substance abuse on readmission, provide empirical evidence for several pre-existing but unverified hypotheses. The analysis also reveals previously undiscovered connections such as the influence of socioeconomic factors like lack of housing and malnutrition.


Stochastic EM for Shuffled Linear Regression

arXiv.org Machine Learning

We consider the problem of inference in a linear regression model in which the relative ordering of the input features and output labels is not known. Such datasets naturally arise from experiments in which the samples are shuffled or permuted during the protocol. In this work, we propose a framework that treats the unknown permutation as a latent variable. We maximize the likelihood of observations using a stochastic expectation-maximization (EM) approach. We compare this to the dominant approach in the literature, which corresponds to hard EM in our framework. We show on synthetic data that the stochastic EM algorithm we develop has several advantages, including lower parameter error, less sensitivity to the choice of initialization, and significantly better performance on datasets that are only partially shuffled. We conclude by performing two experiments on real datasets that have been partially shuffled, in which we show that the stochastic EM algorithm can recover the weights with modest error.


Probabilistic Formulations of Regression with Mixed Guidance

arXiv.org Machine Learning

Regression problems assume every instance is annotated (labeled) with a real value, a form of annotation we call \emph{strong guidance}. In order for these annotations to be accurate, they must be the result of a precise experiment or measurement. However, in some cases additional \emph{weak guidance} might be given by imprecise measurements, a domain expert or even crowd sourcing. Current formulations of regression are unable to use both types of guidance. We propose a regression framework that can also incorporate weak guidance based on relative orderings, bounds, neighboring and similarity relations. Consider learning to predict ages from portrait images, these new types of guidance allow weaker forms of guidance such as stating a person is in their 20s or two people are similar in age. These types of annotations can be easier to generate than strong guidance. We introduce a probabilistic formulation for these forms of weak guidance and show that the resulting optimization problems are convex. Our experimental results show the benefits of these formulations on several data sets.


Data Science Simplified Part 4: Simple Linear Regression Models

@machinelearnbot

Linear regression models are not perfect. It tries to approximate the relationship between dependent and independent variables in a straight line. Some errors can be reduced. Some errors are inherent in the nature of the problem. These errors cannot be eliminated. They are called as an irreducible error, the noise term in the true relationship that cannot fundamentally be reduced by any model.


Machine Learning 001 Applications of Machine Learning

#artificialintelligence

Want to watch this again later? Sign in to add this video to a playlist. Report Need to report the video? Sign in to report inappropriate content. Report Need to report the video?


Collaborative targeted minimum loss inference from continuously indexed nuisance parameter estimators

arXiv.org Machine Learning

Suppose that we wish to infer the value of a statistical parameter at a law from which we sample independent observations. Suppose that this parameter is smooth and that we can define two variation-independent, infinite-dimensional features of the law, its so called Q- and G-components (comp.), such that if we estimate them consistently at a fast enough product of rates, then we can build a confidence interval (CI) with a given asymptotic level based on a plain targeted minimum loss estimator (TMLE). The estimators of the Q- and G-comp. would typically be by products of machine learning algorithms. We focus on the case that the machine learning algorithm for the G-comp. is fine-tuned by a real-valued parameter h. Then, a plain TMLE with an h chosen by cross-validation would typically not lend itself to the construction of a CI, because the selection of h would trade-off its empirical bias with something akin to the empirical variance of the estimator of the G-comp. as opposed to that of the TMLE. A collaborative TMLE (C-TMLE) might, however, succeed in achieving the relevant trade-off. We construct a C-TMLE and show that, under high-level empirical processes conditions, and if there exists an oracle h that makes a bulky remainder term asymptotically Gaussian, then the C-TMLE is asymptotically Gaussian hence amenable to building a CI provided that its asymptotic variance can be estimated too. We illustrate the construction and main result with the inference of the average treatment effect, where the Q-comp. consists in a marginal law and a conditional expectation, and the G-comp. is a propensity score (a conditional probability). We also conduct a multi-faceted simulation study to investigate the empirical properties of the collaborative TMLE when the G-comp. is estimated by the LASSO. Here, h is the bound on the l1-norm of the candidate coefficients.


Hybrid content-based and collaborative filtering recommendations with {ordinal} logistic regression (2): Recommendation as discrete choice

@machinelearnbot

In this continuation of "Hybrid content-based and collaborative filtering recommendations wi..." I will describe the application of the {ordinal} clm() function to test a new, hybrid content-based, collaborative filtering approach to recommender engines by fitting a class of ordinal logistic (aka ordered logit) models to ratings data from the MovieLens 100K dataset. All R code used in this project can be obtained from the respective GitHub repository; the chunks of code present in the body of the post illustrate the essential steps only. The MovieLens 100K dataset can be obtained from the GroupLens research laboratory of the Department of Computer Science and Engineering at the University of Minnesota. This second part of the study relies on the R code in OrdinalRecommenders_3.R and presents the model training, cross-validation, and the analyses. But before we proceed to approach the recommendation problem from a viewpoint of discrete choice modeling, let me briefly remind you of the results of the feature engineering phase and explain what happens in OrdinalRecommenders_2.R which prepares the data frames that are passed to clm().


10 Machine Learning Algorithms You Should Know to Become a Data Scientist - DZone AI

#artificialintelligence

Let's say I am given an Excel sheet with data about various fruits and I have to tell which look like Apples. What I will do is ask a question "Which fruits are red and round?" and divide all fruits which answer yes and no to the question. Now, All Red and Round fruits might not be apples and all apples won't be red and round. So I will ask a question "Which fruits have red or yellow color hints on them? " on red and round fruits and will ask "Which fruits are green and round?" on not red and round fruits. Based on these questions I can tell with considerable accuracy which are apples. This cascade of questions is what a decision tree is. However, this is a decision tree based on my intuition.