Goto

Collaborating Authors

 caret


Learn tidymodels with my supervised machine learning course

#artificialintelligence

Today I am happy to announce that a new tidymodels-centric version of my free, online, interactive course, Supervised Machine Learning: Case Studies in R, has been published! This is at least the third version of this course I've built at this point but I believe it to be the best, in terms of how it communicates machine learning concepts and how useful to your real-world problems the demonstrated code will be. Similar to the last time I launched this course, it provides four case studies using data from the real world for you to practice your predictive modeling skills. One question we sometimes field from R users is about choosing to use tidymodels vs. caret. The original version of my course mostly used caret, and caret is a stable and broadly used framework for modeling and machine learning in R.


TSML (Time Series Machine Learnng)

arXiv.org Machine Learning

Over the past years, the industrial sector has seen many innovations brought about by automation. Inherent in this automation is the installation of sensor networks for status monitoring and data collection. One of the major challenges in these data-rich environments is how to extract and exploit information from these large volume of data to detect anomalies, discover patterns to reduce downtimes and manufacturing errors, reduce energy usage, predict faults/failures, effective maintenance schedules, etc. To address these issues, we developed TSML. Its technology is based on using the pipeline of lightweight filters as building blocks to process huge amount of industrial time series data in parallel.


Automatic Machine Learning Derived from Scholarly Big Data

arXiv.org Machine Learning

One of the challenging aspects of applying machine learning is the need to identify the algorithms that will perform best for a given dataset. This process can be difficult, time consuming and often requires a great deal of domain knowledge. We present Sommelier, an expert system for recommending the machine learning algorithms that should be applied on a previously unseen dataset. Sommelier is based on word embedding representations of the domain knowledge extracted from a large corpus of academic publications. When presented with a new dataset and its problem description, Sommelier leverages a recommendation model trained on the word embedding representation to provide a ranked list of the most relevant algorithms to be used on the dataset. We demonstrate Sommelier's effectiveness by conducting an extensive evaluation on 121 publicly available datasets and 53 classification algorithms. The top algorithms recommended for each dataset by Sommelier were able to achieve on average 97.7% of the optimal accuracy of all surveyed algorithms.


Walkthrough of the dummyVars function from the {caret} package: Machine Learning with R

#artificialintelligence

Walkthrough of the dummyVars function from the {caret} package: Machine Learning with R MORE: Signup for my newsletter and more: http://www.viralml.com My books on Amazon: The Little Book of Fundamental Indicators: Hands-On Market Analysis with Python: Find Your Market Bearings with Python, Jupyter Notebooks, and Freely Available Data: https://amzn.to/2DERG3d Create Income Streams with Online Classes: Design Classes That Generate Long-Term Revenue: https://amzn.to/2VToEHK


Caret Package - A Practical Guide to Machine Learning in R

#artificialintelligence

Caret Package is a comprehensive framework for building machine learning models in R. In this tutorial, I explain nearly all the core features of the caret package and walk you through the step-by-step process of building predictive models. Be it a decision tree or xgboost, caret helps to find the optimal model in the shortest possible time. Caret nicely integrates all the activities associated with the model development in a streamlined workflow, for nearly every major ML algorithm available in R. Actually we will not just stop with the caret package but will also go a step ahead and see how to smartly ensemble predictions from multiple best models and possibly produce an even better prediction using caretEnsemble. Caret is short for Classification And REgression Training. With R having so many implementations of machine learning algorithms, spread across packages it may be challenging to keep track of which algorithm resides in which package. Sometimes the syntax and the way to implement the algorithm differ across packages combined with preprocessing and looking at the help page for the hyperparameters (parameters that define how the algorithm learns) can make building predictive models an involved task. Well, thanks to caret because no matter which package the algorithm resides, caret will remember that for you and may just prompt you to run install.package Later in this tutorial I will show how to see all the available ML algorithms supported by caret (it's a long list!) and what hyperparameters can be tuned.


Machine Learning with R: An Irresponsibly Fast Tutorial

#artificialintelligence

As I said in Becoming a data hacker, R is an awesome programming language for data analysts, especially for people just getting started. In this post, I will give you a super quick, very practical, theory-free, hands-on intro to writing a simple classification model in R, using the caret package. If you want to skip the tutorial, you can find the R code here. Quick note: if the code examples look weird for you on mobile, give it a try on a desktop (you can't do the tutorial on your phone, anyway!). One of the biggest barriers to learning for budding data scientists is that there are so many different R packages for machine learning.


Encoding categorical variables: one-hot and beyond

#artificialintelligence

R has "one-hot" encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can't point to it as it is everywhere. Much of the encoding in R is essentially based on "contrasts" implemented in stats::model.matrix() Note: do not use base::data.matrix() The above mal-coding can be a critical flaw when you are building a model and then later using the model on new data (be it cross-validation data, test data, or future application data). Many R users are not familiar with the above issue as encoding is hidden in model training, and how to encode new data is stored as part of the model.


Encoding categorical variables: one-hot and beyond

#artificialintelligence

R has "one-hot" encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can't point to it as it is everywhere. Much of the encoding in R is essentially based on "contrasts" implemented in stats::model.matrix() Note: do not use base::data.matrix() The above mal-coding can be a critical flaw when you are building a model and then later using the model on new data (be it cross-validation data, test data, or future application data).


9 nifty Gboard for Android tricks you need to try

PCWorld

The only problem with Google's Gboard keyboard for Android is that I'm totally hooked on its best features. Read on for nine of the niftiest Gboard features, from dedicated number rows and an on-demand numeric keypad to "neural" translations and a long-press shortcut for oft-used symbols. Note: Yes, there's also a version of Gboard for iOS, but most of my favorite Gboard tricks only work on the Android version. Tapping a virtual keypad with a single thumb can be something of a stretch if your phone has a massive screen. Luckily, Gboard has a clever feature that makes it easier to tap with just one hand.


Propensity score prediction for electronic healthcare databases using Super Learner and High-dimensional Propensity Score Methods

arXiv.org Machine Learning

The optimal learner for prediction modeling varies depending on the underlying data-generating distribution. Super Learner (SL) is a generic ensemble learning algorithm that uses cross-validation to select among a "library" of candidate prediction models. The SL is not restricted to a single prediction model, but uses the strengths of a variety of learning algorithms to adapt to different databases. While the SL has been shown to perform well in a number of settings, it has not been thoroughly evaluated in large electronic healthcare databases that are common in pharmacoepidemiology and comparative effectiveness research. In this study, we applied and evaluated the performance of the SL in its ability to predict treatment assignment using three electronic healthcare databases. We considered a library of algorithms that consisted of both nonparametric and parametric models. We also considered a novel strategy for prediction modeling that combines the SL with the high-dimensional propensity score (hdPS) variable selection algorithm. Predictive performance was assessed using three metrics: the negative log-likelihood, area under the curve (AUC), and time complexity. Results showed that the best individual algorithm, in terms of predictive performance, varied across datasets. The SL was able to adapt to the given dataset and optimize predictive performance relative to any individual learner. Combining the SL with the hdPS was the most consistent prediction method and may be promising for PS estimation and prediction modeling in electronic healthcare databases.