CatBoost vs. Light GBM vs. XGBoost

@machinelearnbot

I recently participated in this Kaggle competition (WIDS Datathon by Stanford) where I was able to land up in Top 10 using various boosting algorithms. Since then, I have been very curious about the fine workings of each model including parameter tuning, pros and cons and hence decided to write this blog. Despite the recent re-emergence and popularity of neural networks, I am focusing on boosting algorithms because they are still more useful in the regime of limited training data, little training time and little expertise for parameter tuning. Since XGBoost (often called GBM Killer) has been in the machine learning world for a longer time now with lots of articles dedicated to it, this post will focus more on CatBoost & LGBM. LightGBM uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value while XGBoost uses pre-sorted algorithm & Histogram-based algorithm for computing the best split.


Creature that churned out giant sperm may have died out because their sexual organs were TOO LARGE

Daily Mail - Science & tech

The extravagant lengths that males will go to to bed a member of the opposite sex can come at the risk of extinction for some species, researchers said Wednesday. Sexual selection, how animals attract and choose mates, can be thanked for showy traits such as the elegant peacock's tail, the grand antlers of a stag, and the bushy mane of a lion. A study published in the science journal Nature, however, found that some creatures take it too far. US-based scientists analysed the fossils of thousands of ancient crustaceans called ostracods, tiny clam-shaped critters that have been on the planet for nearly 500 millions years. The US-based researchers looked at 93 species of extinct ostracods that lived about 85 million to 65 million years ago during the late Cretaceous period.


Central Clustering of Categorical Data with Automated Feature Weighting

AAAI Conferences

The ability to cluster high-dimensional categorical data is essential for many machine learning applications such as bioinfomatics. Currently, central clustering of categorical data is a difficult problem due to the lack of a geometrically interpretable definition of a cluster center. In this paper, we propose a novel kernel-density-based definition using a Bayes-type probability estimator. Then, a new algorithm called k-centers is proposed for central clustering of categorical data, incorporating a new feature weighting scheme by which each attribute is automatically assigned with a weight measuring its individual contribution for the clusters. Experimental results on real-world data show outstanding performance of the proposed algorithm, especially in recognizing the biological patterns in DNA sequences.


CRAFT: ClusteR-specific Assorted Feature selecTion

arXiv.org Machine Learning

We present a framework for clustering with cluster-specific feature selection. The framework, CRAFT, is derived from asymptotic log posterior formulations of nonparametric MAP-based clustering models. CRAFT handles assorted data, i.e., both numeric and categorical data, and the underlying objective functions are intuitively appealing. The resulting algorithm is simple to implement and scales nicely, requires minimal parameter tuning, obviates the need to specify the number of clusters a priori, and compares favorably with other methods on real datasets.


Heart of Darkness: Logistic Regression vs. Random Forest

#artificialintelligence

The'functional needs repair' category of the target variable only makes up about 7% of the whole set. The implication is that whatever algorithm you end up using it's probably going to learn the other two balanced classes a lot better than this one. Such is data science: the struggle is real. The first thing we're going to do is create an'age' variable for the waterpoints as that seems highly relevant. The'population' variable also has a highly right-skewed distribution so we're going to change that as well: One of the most important points we learned from the week before and something that will stay with me is the idea of coming up with a baseline model as fast as one can.