Goto

Collaborating Authors

 fastml


fastml: Guarded Resampling Workflows for Safer Automated Machine Learning in R

arXiv.org Machine Learning

Preprocessing leakage arises when scaling, imputation, or other data-dependent transformations are estimated before resampling, inflating apparent performance while remaining hard to detect. We present fastml, an R package that provides a single-call interface for leakage-aware machine learning through guarded resampling, where preprocessing is re-estimated inside each resample and applied to the corresponding assessment data. The package supports grouped and time-ordered resampling, blocks high-risk configurations, audits recipes for external dependencies, and includes sandboxed execution and integrated model explanation. We evaluate fastml with a Monte Carlo simulation contrasting global and fold-local normalization, a usability comparison with tidymodels under matched specifications, and survival benchmarks across datasets of different sizes. The simulation demonstrates that global preprocessing substantially inflates apparent performance relative to guarded resampling. fastml matched held-out performance obtained with tidymodels while reducing workflow orchestration, and it supported consistent benchmarking of multiple survival model classes through a unified interface.


My 5 Favorite Data Science Portfolios ยท Learning With Data

#artificialintelligence

At the end of the article, I posted a link to an example portfolio that I liked by Tim Dettmers. Afterward, I had a few people ask me to compile a larger list of great data science portfolios and projects. While not a portfolio, but rather a project, I think this is a great format to try and exemplify. Melissa Runfeldt did a great job defining and motivating her problem, discussing how she gathered data and explaining her methods with images of results. All in a way that would be easy for a non-technical person to follow (at least at a high level).


Deep learning architecture diagrams - FastML

@machinelearnbot

As a wild stream after a wet season in African savanna diverges into many smaller streams forming lakes and puddles, so deep learning has diverged into a myriad of specialized architectures. Each architecture has a diagram. Here are some of them. Neural networks are conceptually simple, and that's their beauty. A bunch of homogenous, uniform units, arranged in layers, weighted connections between them, and that's all.


Deep learning architecture diagrams - FastML

#artificialintelligence

As a wild stream after a wet season in African savanna diverges into many smaller streams forming lakes and puddles, so deep learning has diverged into a myriad of specialized architectures. Each architecture has a diagram. Here are some of them. Neural networks are conceptually simple, and that's their beauty. A bunch of homogenous, uniform units, arranged in layers, weighted connections between them, and that's all.


Tuning hyperparams fast with Hyperband - FastML

#artificialintelligence

Hyperband is a relatively new method for tuning iterative algorithms. It performs random sampling and attempts to gain the edge by using time spent optimizing in the best way. We explain a few things that were not clear to us right away, and try the algorithm in practice. Candidates for tuning with Hyperband include all the SGD derivatives - meaning the whole deep learning - and tree ensembles: gradient boosting, and perhaps to a lesser extent, random forest and extremely randomized trees. To quantify this idea, we compare to random run at twice the speed which beats the two Bayesian Optimization methods, i.e., running random search for twice as long yields superior results.


Deep learning architecture diagrams - FastML

#artificialintelligence

Like a wild stream after a wet season in African savanna diverges into many smaller streams forming lakes and puddles, deep learning has diverged into a myriad of specialized architectures. Each architecture has a diagram. Here are some of them. Neural networks are conceptually simple, and that's their beauty. A bunch of homogenous, uniform units, arranged in layers, weighted connections between them, and that's all.


Adversarial validation, part two - FastML

#artificialintelligence

In this second article on adversarial validation we get to the meat of the matter: what we can do when train and test sets differ. Will we be able to make a better validation set? The problem with training examples being different from test examples is that validation won't be any good for comparing models. That's because validation examples originate in the training set. We can see this effect when using Numerai data, which comes from financial time series.


Loading data in Torch (is a mess) - FastML

@machinelearnbot

Torch 7 is a GPU accelerated deep learning framework. It had been rather obscure until recent publicity caused by adoption by Facebook and DeepMind. This entirely anecdotal article describes our experiences trying to load some data in Torch. We had great expectations about Torch. It seemed like a dream come true, especially with endorsement by DeepMind and LeCun's group at Facebook.


Coming out - FastML

#artificialintelligence

People often ask how we've been able to learn about and cover so many different and diverse topics in machine learning (using at least three different programming languages - Python, Matlab, and R) and generally achieve such prominence in the community, all this in a relatively short time. Today we finally give a definitive answer. There's no Zygmunt the Polish economist ever willing to relocate to San Francisco. And the "we" that we always use in the posts is not majestic plural. We are three Chinese PhD students: Ah, Hai and Wang.