gridsearch
Does gridsearch on random forest make sense?
You are right that randomness will play a role (like with many other algorithms including MCMC samplers for Bayesian models, XGBoost, LightGBM, neural networks etc.) in the results. The obvious way to minimize randomness in the results of any hyper-parameter optimization method for RF (whether it's random grid-search, grid search or some Bayesian hyperparameter optimization method) is to increase the number of trees (which reduces the randomness in the model behavior - albeit at the cost of an increased training time). Alternatively, you construct a surrogate model on top of the results that takes into account that the signal, of where the best model in the hyperparameter landscape is, is noisy through an appropriate amount of smoothing/regularization.
GridSearch: the ultimate Machine Learning Tool
The goal of supervised Machine Learning is to build a prediction function based on historical data. This data has independent (explanatory) variables and a target variable (the variable that you want to predict). Once a predictive model has been built, we measure its error on a separate testing data set. We do this using KPIs that allow quantifying the error of the model, for example, the Mean Square Error in a regression context (quantitative target variable) or the Accuracy in a classification context (categorical target variable). The model with the smallest error is generally selected as the best model.
Battle of the Boosters
We have come a long way in the world of Gradient Boosting. If you have followed the whole series, you should have a much better understanding about the theory and practical aspects of the major algorithms in this space. After a grim walk through the math and theory behind these algorithms, I thought it would be a fun change to see all of them in action in a highly practical blog post. I have chosen a few datasets for regression from Kaggle Datasets, mainly because it's easy to setup and run in Google Colab. Another reason is that I do not need to spend a lot of time in data preprocessing, instead I can pick one of the public kernels and get cracking.
Understanding Decision Trees In Machine Learning and How To Implement It In Python Using sklearn
Decision Trees are a type of supervised learning used for classification (yes/no) and regression (continuous data) where the data is continuously split according to a certain parameter. The predicted class is derived from features of the data. The following article creates a Decision Tree from the 311 on 3.11 Project. In this project, the resolution outcome being positive or negative is what is being predicted. Agency: NYPD, Dept of Transportation, Dept of Health & Mental Hygiene, Dept of Sanitation, Dept of Housing Preservation and Development, Dept of Parks and Recreation, etc Borough: Brooklyn, Queens, Manhattan, Bronx, Staten Island Location: Longitude/Latitude, Cross Streets, Intersections Created/Closed Date Complaint Type: Heat/Hot Water, Rodent, Noise, Street Condition, Illegal Parking, Unsanitary Condition, Blocked Driveway are just a few examples.