Goto

Collaborating Authors

 Ensemble Learning


Active Learning for Efficient Testing of Student Programs

arXiv.org Artificial Intelligence

In this work, we propose an automated method to identify semantic bugs in student programs, called ATAS, which builds upon the recent advances in both symbolic execution and active learning. Symbolic execution is a program analysis technique which can generate test cases through symbolic constraint solving. Our method makes use of a reference implementation of the task as its sole input. We compare our method with a symbolic execution-based baseline on 6 programming tasks retrieved from CodeForces comprising a total of 23K student submissions. We show an average improvement of over 2.5x over the baseline in terms of runtime (thus making it more suitable for online evaluation), without a significant degradation in evaluation accuracy.


Asynchronous Parallel Sampling Gradient Boosting Decision Tree

arXiv.org Machine Learning

With the development of big data technology, Gradient Boosting Decision Tree, i.e. GBDT, becomes one of the most important machine learning algorithms for its accurate output. However, the training process of GBDT needs a lot of computational resources and time. In order to accelerate the training process of GBDT, the asynchronous parallel sampling gradient boosting decision tree, abbr. asynch-SGBDT is proposed in this paper. Via introducing sampling, we adapt the numerical optimization process of traditional GBDT training process into stochastic optimization process and use asynchronous parallel stochastic gradient descent to accelerate the GBDT training process. Meanwhile, the theoretical analysis of asynch-SGBDT is provided by us in this paper. Experimental results show that GBDT training process could be accelerated by asynch-SGBDT. Our asynchronous parallel strategy achieves an almost linear speedup, especially for high-dimensional sparse datasets.


Hyperparameters and Tuning Strategies for Random Forest

arXiv.org Machine Learning

The random forest algorithm (RF) has several hyperparameters that have to be set by the user, e.g., the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain and the number of trees. In this paper, we first provide a literature review on the parameters' influence on the prediction performance and on variable importance measures, also considering interactions between hyperparameters. It is well known that in most cases RF works reasonably well with the default values of the hyperparameters specified in software packages. Nevertheless, tuning the hyperparameters can improve the performance of RF. In the second part of this paper, after a brief overview of tuning strategies we demonstrate the application of one of the most established tuning strategies, model-based optimization (MBO). To make it easier to use, we provide the tuneRanger R package that tunes RF with MBO automatically. In a benchmark study on several datasets, we compare the prediction performance and runtime of tuneRanger with other tuning implementations in R and RF with default hyperparameters.


Start With Gradient Boosting, Results from Comparing 13 Algorithms on 165 Datasets - Machine Learning Mastery

#artificialintelligence

Which machine learning algorithm should you use? It is a central question in applied machine learning. In a recent paper by Randal Olson and others, they attempt to answer it and give you a guide for algorithms and parameters to try on your problem first, before spot checking a broader suite of algorithms. In this post, you will discover a study and findings from evaluating many machine learning algorithms across a large number of machine learning datasets and the recommendations made from this study. Start With Gradient Boosting, but Always Spot Check Algorithms and Configurations Photo by Ritesh Man Tamrakar, some rights reserved.


Synced Tree Boosting With XGBoost โ€“ Why Does XGBoost Win "Every" Machine Learning Competition?

#artificialintelligence

Tree boosting has empirically proven to be efficient for predictive mining for both classification and regression. For many years, MART (multiple additive regression trees) has been the tree boosting method of choice. But a starting from 2015, a first to try, always winning algorithm surged to the surface: XGBoost. This algorithm re-implements the tree boosting and gained popularity by winning Kaggle and other data science competition. In the thesis Tree Boosting With XGBoost โ€“ Why Does XGBoost Win "Every" Machine Learning Competition, the author Didrik Nielsen from Norwegian University of Science and Technology is trying to: The paper introduce in first place the supervised learning task and discuss the model selection techniques.


CatBoost vs. Light GBM vs. XGBoost โ€“ Towards Data Science

#artificialintelligence

I recently participated in this Kaggle competition (WIDS Datathon by Stanford) where I was able to land up in Top 10 using various boosting algorithms. Since then, I have been very curious about the fine workings of each model including parameter tuning, pros and cons and hence decided to write this blog. Despite the recent re-emergence and popularity of neural networks, I am focusing on boosting algorithms because they are still more useful in the regime of limited training data, little training time and little expertise for parameter tuning. Since XGBoost (often called GBM Killer) has been in the machine learning world for a longer time now with lots of articles dedicated to it, this post will focus more on CatBoost & LGBM. LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split.


How Random Forest Algorithm Works in Machine Learning

#artificialintelligence

This is one of the best introductions to Random Forest algorithm. The author introduces the algorithm with a real-life story and then provides applications in four different fields to help beginners learn and know more about this algorithm. To begin the article, the author highlights one advantage of Random Forest algorithm that excites him: that it can be used for both classification and regression problems. The author chose a classification task for this article, as this will be easier for a beginner to learn. Regression will be the application problem in the next, up-coming article.


Explanations of model predictions with live and breakDown packages

arXiv.org Machine Learning

Predictive modelling is a very exciting field with many different applications. Lots of algorithms have been developed in this area. According to many Kaggle competitions (Fogg, 2016), winning solutions are often obtained with elastic tools like random forest, gradient boosting or neural networks. These algorithms have many strengths but also share a major weakness, which is the lack of interpretability of a model structure. A single random forest, an xgboost model or a neural network may be parametrized with thousands of parameters which makes these models hard to understand.


Play with classification of Iris data using gradient boosting

#artificialintelligence

Gradient boosting is one of the most widely used machine learning models in practice, with more and more people like to use it in Kaggle competitions. Are you interested in seeing how to use gradient boosting model for classification in SAS Visual Data Mining and Machine Learning? Here I play with the classification of Fisher's Iris flower dataset using gradient boosting, and this may serve as a start point to those interested in trying the classification models in SAS Visual Data Mining and Machine Learning product. Fisher's Iris data is a well-known dataset in data mining. Per Wikipedia, Fisher developed a linear discriminant model to distinguish the species from each other by the features provided in the dataset.


CatBoost vs. Light GBM vs. XGBoost

@machinelearnbot

I recently participated in this Kaggle competition (WIDS Datathon by Stanford) where I was able to land up in Top 10 using various boosting algorithms. Since then, I have been very curious about the fine workings of each model including parameter tuning, pros and cons and hence decided to write this blog. Despite the recent re-emergence and popularity of neural networks, I am focusing on boosting algorithms because they are still more useful in the regime of limited training data, little training time and little expertise for parameter tuning. Since XGBoost (often called GBM Killer) has been in the machine learning world for a longer time now with lots of articles dedicated to it, this post will focus more on CatBoost & LGBM. LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split.