tidymodel
fastml: Guarded Resampling Workflows for Safer Automated Machine Learning in R
Korkmaz, Selcuk, Goksuluk, Dincer, Karaismailoglu, Eda
Preprocessing leakage arises when scaling, imputation, or other data-dependent transformations are estimated before resampling, inflating apparent performance while remaining hard to detect. We present fastml, an R package that provides a single-call interface for leakage-aware machine learning through guarded resampling, where preprocessing is re-estimated inside each resample and applied to the corresponding assessment data. The package supports grouped and time-ordered resampling, blocks high-risk configurations, audits recipes for external dependencies, and includes sandboxed execution and integrated model explanation. We evaluate fastml with a Monte Carlo simulation contrasting global and fold-local normalization, a usability comparison with tidymodels under matched specifications, and survival benchmarks across datasets of different sizes. The simulation demonstrates that global preprocessing substantially inflates apparent performance relative to guarded resampling. fastml matched held-out performance obtained with tidymodels while reducing workflow orchestration, and it supported consistent benchmarking of multiple survival model classes through a unified interface.
EZtune: A Package for Automated Hyperparameter Tuning in R
Statistical learning models have been growing in popularity in recent years. Many of these models have hyperparameters that must be tuned for models to perform well. Tuning these parameters is not trivial. EZtune is an R package with a simple user interface that can tune support vector machines, adaboost, gradient boosting machines, and elastic net. We first provide a brief summary of the the models that EZtune can tune, including a discussion of each of their hyperparameters. We then compare the ease of using EZtune, caret, and tidymodels. This is followed with a comparison of the accuracy and computation times for models tuned with EZtune and tidymodels. We conclude with a demonstration of how how EZtune can be used to help select a final model with optimal predictive power. Our comparison shows that EZtune can tune support vector machines and gradient boosting machines with EZtune also provides a user interface that is easy to use for a novice to statistical learning models or R.
Optimizing Machine Learning Workflows with Tidymodels
Tidymodels is a package that is designed to streamline machine learning workflows in R. It consists of a suite of packages that can be used to pre-process data, build and tune machine learning models, and evaluate their performance. Tidymodels is particularly useful for those who are new to machine learning, as it provides an easy-to-use interface for building and evaluating models, and it can be used to quickly iterate through different model architectures and parameters. To get started with tidymodels, you will need to install the package and its dependencies. Once you have installed tidymodels, you can start using it in your machine learning workflows. Before building a machine learning model, it is often necessary to pre-process the data to ensure that it is in a suitable format.
Introducing random forests in R
In this post, I will present how to use random forests in classification, a prediction technique consisting in generating a set of trees (hence, a forest) bootstrapping the features used in each tree. We do this to obtain trees that are not necessarily using the strongest predictors at the beginning. I will test this technique in a LoanDefaults dataset to predict which customers will default the paying of a loan in a specific month. This dataset has two interesting features: the number of positive cases is much smaller than the negatives and requires some preprocessing of the existing features. I will be using the ranger (RANdom forest GEneRator) package, skimr to get a summary of data, rpart and rpart.plot to generate an alternative decision tree model, BAdatasets to access the dataset, tidymodels for prediction workflow facilities and forcats for the variable importance plot.
Tidy Time Series Forecasting in R with Spark
I'm SUPER EXCITED to show fellow time-series enthusiasts a new way that we can scale time series analysis using an amazing technology called Spark! Without Spark, large-scale forecasting projects of 10,000 time series can take days to run because of long-running for-loops and the need to test many models on each time series. Spark has been widely accepted as a "big data" solution, and we'll use it to scale-out (distribute) our time series analysis to Spark Clusters, and run our analysis in parallel. Spark is an amazing technology for processing large-scale data science workloads. Modeltime is a state-of-the-art forecasting library that I personally developed for "Tidy Forecasting" in R. Modeltime now integrates a Spark Backend with capability of forecasting 10,000 time series using distributed Spark Clusters.
Introducing Modeltime: Tidy Time Series Forecasting using Tidymodels
I'm beyond excited to introduce modeltime, a new time series forecasting package designed to speed up model evaluation, selection, and forecasting. Follow the updated modeltime article to get started with modeltime. If you like what you see, I have an Advanced Time Series Course where you will become the time-series expert for your organization by learning modeltime and timetk. This article is part of a series of software announcements on the Modeltime Forecasting Ecosystem. Register to stay in the know on new cutting-edge R software like modeltime.
How to Use Catboost with Tidymodels
So you want to compete in a kaggle competition with R and you want to use tidymodels. In this howto I show how you can use CatBoost with tidymodels. I give very terse descriptions of what the steps do, because I believe you read this post for implementation, not background on how the elements work. This tutorial is extremely similar to my previous post about using lightGBM with Tidymodels. It is a unified machine learning framework that uses sane defaults, keeps model definitions andimplementation separate and allows you to easily swap models or change parts of the processing.
Tidymodels: tidy machine learning in R
Over the past few years, tidymodels has been gradually emerging as the tidyverse's machine learning toolkit. Well, it turns out that R has a consistency problem. Since everything was made by different people and using different principles, everything has a slightly different interface, and trying to keep everything in line can be frustrating. Several years ago, Max Kuhn (formerly at Pfeizer, now at RStudio) developed the caret R package (see my caret tutorial) aimed at creating a uniform interface for the massive variety of machine learning models that exist in R. Caret was great in a lot of ways, but also limited in others. In my own use, I found it to be quite slow whenever I tried to use on problems of any kind of modest size.
Customer Churn Modeling using Machine Learning with parsnip
This article comes from Diego Usai, a student in Business Science University. Diego has completed both 101 (Data Science Foundations) and 201 (Advanced Machine Learning & Business Consulting) courses. Diego shows off his progress in this Customer Churn Tutorial using Machine Learning with parsnip. Diego originally posted the article on his personal website, diegousai.io, Recently I have completed the online course Business Analysis With R focused on applied data and business science with R, which introduced me to a couple of new modelling concepts and approaches.