Decision Tree Learning
Measuring the Algorithmic Convergence of Randomized Ensembles: The Regression Setting
Lopes, Miles E., Wu, Suofei, Lee, Thomas C. M.
When randomized ensemble methods such as bagging and random forests are implemented, a basic question arises: Is the ensemble large enough? In particular, the practitioner desires a rigorous guarantee that a given ensemble will perform nearly as well as an ideal infinite ensemble (trained on the same data). The purpose of the current paper is to develop a bootstrap method for solving this problem in the context of regression --- which complements our companion paper in the context of classification (Lopes 2019). In contrast to the classification setting, the current paper shows that theoretical guarantees for the proposed bootstrap can be established under much weaker assumptions. In addition, we illustrate the flexibility of the method by showing how it can be adapted to measure algorithmic convergence for variable selection. Lastly, we provide numerical results demonstrating that the method works well in a range of situations.
The Use of Binary Choice Forests to Model and Estimate Discrete Choice Models
Chen, Ningyuan, Gallego, Guillermo, Tang, Zhuodong
We show the equivalence of discrete choice models and the class of binary choice forests, which are random forest based on binary choice trees. This suggests that standard machine learning techniques based on random forest can serve to estimate discrete choice model with an interpretable output. This is confirmed by our data driven result that states that random forest can accurately predict the choice probability of any discrete choice model. Our framework has unique advantages: it can capture behavioral patterns such as irrationality or sequential searches; it handles nonstandard formats of training data that result from aggregation; it can measure product importance based on how frequently a random customer would make decisions depending on the presence of the product; it can also incorporate price information. Our numerical results show that binary choice forest can outperform the best parametric models with much better computational times.
KiloGrams: Very Large N-Grams for Malware Classification
Raff, Edward, Fleming, William, Zak, Richard, Anderson, Hyrum, Finlayson, Bill, Nicholas, Charles, McLean, Mark
N-grams have been a common tool for information retrieval and machine learning applications for decades. In nearly all previous works, only a few values of $n$ are tested, with $n > 6$ being exceedingly rare. Larger values of $n$ are not tested due to computational burden or the fear of overfitting. In this work, we present a method to find the top-$k$ most frequent $n$-grams that is 60$\times$ faster for small $n$, and can tackle large $n\geq1024$. Despite the unprecedented size of $n$ considered, we show how these features still have predictive ability for malware classification tasks. More important, large $n$-grams provide benefits in producing features that are interpretable by malware analysis, and can be used to create general purpose signatures compatible with industry standard tools like Yara. Furthermore, the counts of common $n$-grams in a file may be added as features to publicly available human-engineered features that rival efficacy of professionally-developed features when used to train gradient-boosted decision tree models on the EMBER dataset.
Local Interpretation Methods to Machine Learning Using the Domain of the Feature Space
Botari, Tiago, Izbicki, Rafael, de Carvalho, Andre C. P. L. F.
As machine learning becomes an important part of many real world applications affecting human lives, new requirements, besides high predictive accuracy, become important. One important requirement is transparency, which has been associated with model interpretability. Many machine learning algorithms induce models difficult to interpret, named black box. Moreover, people have difficulty to trust models that cannot be explained. In particular for machine learning, many groups are investigating new methods able to explain black box models. These methods usually look inside the black models to explain their inner work. By doing so, they allow the interpretation of the decision making process used by black box models. Among the recently proposed model interpretation methods, there is a group, named local estimators, which are designed to explain how the label of particular instance is predicted. For such, they induce interpretable models on the neighborhood of the instance to be explained. Local estimators have been successfully used to explain specific predictions. Although they provide some degree of model interpretability, it is still not clear what is the best way to implement and apply them. Open questions include: how to best define the neighborhood of an instance? How to control the trade-off between the accuracy of the interpretation method and its interpretability? How to make the obtained solution robust to small variations on the instance to be explained? To answer to these questions, we propose and investigate two strategies: (i) using data instance properties to provide improved explanations, and (ii) making sure that the neighborhood of an instance is properly defined by taking the geometry of the domain of the feature space into account. We evaluate these strategies in a regression task and present experimental results that show that they can improve local explanations.
Climate-driven statistical models as effective predictors of local dengue incidence in Costa Rica: A Generalized Additive Model and Random Forest approach
Vรกsquez, Paola, Lorรญa, Antonio, Sanchez, Fabio, Barboza, Luis A.
Climate has been an important factor in shaping the distribution and incidence of dengue cases in tropical and subtropical countries. In Costa Rica, a tropical country with distinctive micro-climates, dengue has been endemic since its introduction in 1993, inflicting substantial economic, social, and public health repercussions. Using the number of dengue reported cases and climate data from 2007-2017, we fitted a prediction model applying a Generalized Additive Model (GAM) and Random Forest (RF) approach, which allowed us to retrospectively predict dengue occurrence in five climatological diverse municipalities around the country.
ML DL AI DS BD - An Introduction
In an image recognition application, the raw input may be a matrix of pixels; the first representational layer may abstract the pixels and encode edges; the second layer may compose and encode arrangements of edges; the third layer may encode a nose and eyes; and the fourth layer may recognize that the image contains a face. Importantly, a deep learning process can learn which features to optimally place in which level on its own.
Give Pricing Decisions The AI Edge
In my experience as a business transformation solutions expert, I know that deals are only closed when both buyer and seller see the value. For the seller, this means optimizing revenues and margins. To do this consistently, enterprises must not only know what solutions to offer their customers but also be able to gauge their customers' willingness to pay. In a competitive environment where many players offer similar services and solutions, the ability to consistently offer a price that is well within the customer's zone of price comfort is vital to success. Enterprises in the business to business (B2B) space generally have well-defined policies that govern not only pricing and margin requirements but also discounts, preferential payment terms, and so on.
Optimizing Hyperparameters for Random Forest Algorithms in scikit-learn
Optimizing hyperparameters for machine learning models is a key step in making accurate predictions. Hyperparameters define characteristics of the model that can impact model accuracy and computational efficiency. They are typically set prior to fitting the model to the data. In contrast, parameters are values estimated during the training process that allow the model to fit the data. Hyperparameters are often optimized through trial and error; multiple models are fit with a variety of hyperparameter values, and their performance is compared. For random forest algorithms, one can manipulate a variety of key attributes that define model structure.
Estimating the Algorithmic Variance of Randomized Ensembles via the Bootstrap
Although the methods of bagging and random forests are some of the most widely used prediction methods, relatively little is known about their algorithmic convergence. In particular, there are not many theoretical guarantees for deciding when an ensemble is "large enough" --- so that its accuracy is close to that of an ideal infinite ensemble. Due to the fact that bagging and random forests are randomized algorithms, the choice of ensemble size is closely related to the notion of "algorithmic variance" (i.e. the variance of prediction error due only to the training algorithm). In the present work, we propose a bootstrap method to estimate this variance for bagging, random forests, and related methods in the context of classification. To be specific, suppose the training dataset is fixed, and let the random variable $Err_t$ denote the prediction error of a randomized ensemble of size $t$. Working under a "first-order model" for randomized ensembles, we prove that the centered law of $Err_t$ can be consistently approximated via the proposed method as $t\to\infty$. Meanwhile, the computational cost of the method is quite modest, by virtue of an extrapolation technique. As a consequence, the method offers a practical guideline for deciding when the algorithmic fluctuations of $Err_t$ are negligible.