Goto

Collaborating Authors

 Regression


optimism-corrected regression coefficients using Frank Harrell's method?

#artificialintelligence

I used a regularized (LASSO) cox regression to estimate relapse times of patients and used Frank Harrell's bootstrapping method to obtain an optimism-corrected performance estimate of my model. Would be such an optimism corrected b be a better predictor for unseen cases?


High Dimensional Multivariate Regression and Precision Matrix Estimation via Nonconvex Optimization

arXiv.org Machine Learning

We propose a nonconvex estimator for joint multivariate regression and precision matrix estimation in the high dimensional regime, under sparsity constraints. A gradient descent algorithm with hard thresholding is developed to solve the nonconvex estimator, and it attains a linear rate of convergence to the true regression coefficients and precision matrix simultaneously, up to the statistical error. Compared with existing methods along this line of research, which have little theoretical guarantee, the proposed algorithm not only is computationally much more efficient with provable convergence guarantee, but also attains the optimal finite sample statistical rate up to a logarithmic factor. Thorough experiments on both synthetic and real datasets back up our theory.


Forecasting wind power - Modeling periodic and non-linear effects under conditional heteroscedasticity

arXiv.org Machine Learning

In this article we present an approach that enables joint wind speed and wind power forecasts for a wind park. We combine a multivariate seasonal time varying threshold autoregressive moving average (TVARMA) model with a power threshold generalized autoregressive conditional heteroscedastic (power-TGARCH) model. The modeling framework incorporates diurnal and annual periodicity modeling by periodic B-splines, conditional heteroscedasticity and a complex autoregressive structure with nonlinear impacts. In contrast to usually time-consuming estimation approaches as likelihood estimation, we apply a high-dimensional shrinkage technique. We utilize an iteratively re-weighted least absolute shrinkage and selection operator (lasso) technique. It allows for conditional heteroscedasticity, provides fast computing times and guarantees a parsimonious and regularized specification, even though the parameter space may be vast. We are able to show that our approach provides accurate forecasts of wind power at a turbine-specific level for forecasting horizons of up to 48 hours (short-to medium-term forecasts).


Scalable machine learning with InsightEdge: mobile advertisement clicks prediction – InsightEdge

#artificialintelligence

This blog post will provide an introduction into using machine learning algorithms with InsightEdge. We will go through an exercise to predict mobile advertisement click-through rate with Avazu's dataset. There are several compensation models in online advertising industry, probably the most notable is CPC (Cost Per Click), in which an advertiser pays a publisher when the ad is clicked. Search engine advertising is one of the most popular forms of CPC. It allows advertisers to bid for ad placement in a search engine's sponsored links when someone searches on a keyword that is related to their business offering.


Temporal Topic Modeling to Assess Associations between News Trends and Infectious Disease Outbreaks

arXiv.org Machine Learning

In retrospective assessments, internet news reports have been shown to capture early reports of unknown infectious disease transmission prior to official laboratory confirmation. In general, media interest and reporting peaks and wanes during the course of an outbreak. In this study, we quantify the extent to which media interest during infectious disease outbreaks is indicative of trends of reported incidence. We introduce an approach that uses supervised temporal topic models to transform large corpora of news articles into temporal topic trends. The key advantages of this approach include, applicability to a wide range of diseases, and ability to capture disease dynamics - including seasonality, abrupt peaks and troughs. We evaluated the method using data from multiple infectious disease outbreaks reported in the United States of America (U.S.), China and India. We noted that temporal topic trends extracted from disease-related news reports successfully captured the dynamics of multiple outbreaks such as whooping cough in U.S. (2012), dengue outbreaks in India (2013) and China (2014). Our observations also suggest that efficient modeling of temporal topic trends using time-series regression techniques can estimate disease case counts with increased precision before official reports by health organizations.


What do Predictive Analytics Consultants Do? Part 2

#artificialintelligence

Last week ago I posted an article called, What Do Predictive Analytics Consultants Do? Part 1, describing the general types of activities that we engage in. In the present article, I want to talk about the skills and tools that one should have to perform Predictive Analytics. Although this is not strictly a "What we do" article, knowing the skills we possess and the tools we use will provide some insight into what we do, without talking about some algorithm that you may have never heard of. I am always at a loss in describing the skills of analytics, for there are many. I just completed a new book about analytics (available for FREE--see notes) that has a different approach than Predictive Analytics using R (also available for FREE), though I am using material from three chapters.


Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn - Machine Learning Mastery

#artificialintelligence

Spot-checking is a way of discovering which algorithms perform well on your machine learning problem. You cannot know which algorithms are best suited to your problem before hand. You must trial a number of methods and focus attention on those that prove themselves the most promising. In this post you will discover 6 machine learning algorithms that you can use when spot checking your regression problem in Python with scikit-learn. Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn Photo by frankieleon, some rights reserved.


Making data science accessible – Logistic Regression

@machinelearnbot

Regression is a modelling technique for predicting the values of an outcome variable from one or more explanatory variables. Logistic Regression is a specific approach for describing a binary outcome variable (for example yes/no). Let's assume you are own a new boutique shop. You have a list of potential clients you are thinking of inviting to a special event with the aim of maximizing the number of sales – who should you invite? Data on previous events you have run is a great starting point here, allowing you to predict an individual's likelihood of buying given the information you have on them.


Expanding your machine learning toolkit: Randomized search, computational budgets, and new algorithms by Anonymous

#artificialintelligence

Previously, we wrote about some common trade-offs in machine learning and the importance of tuning models to your specific dataset. We demonstrated how to tune a random forest classifier using grid search, and how cross-validation can help avoid overfitting when tuning hyperparameters (HPs). You'll learn a different strategy for traversing hyperparameter space - randomized search - and how to use it to tune two other classification algorithms - a support vector machine and a regularized logistic regression classifier. We'll keep working with the wine dataset, which contains chemical characteristics of wines of varying quality. As before, our goal is to try to predict a wine's quality from these features.


Singular ridge regression with homoscedastic residuals: generalization error with estimated parameters

arXiv.org Machine Learning

This paper characterizes the conditional distribution properties of the finite sample ridge regression estimator and uses that result to evaluate total regression and generalization errors that incorporate the inaccuracies committed at the time of parameter estimation. The paper provides explicit formulas for those errors. Unlike other classical references in this setup, our results take place in a fully singular setup that does not assume the existence of a solution for the non-regularized regression problem. In exchange, we invoke a conditional homoscedasticity hypothesis on the regularized regression residuals that is crucial in our developments.