Regression in Python using Sklearn, XGBoost and PySpark


In the above story, we have used a Fitbit dataset. Based on the EDA, it was found that steps taken and calories are somewhat linearly correlated and together they may be indicative of a lower risk for all-cause mortality. More interestingly, among our data there is one dataset which has not been used yet which is a weight and BMI log. These data have a distinct nature since they are not necessarily machine generated, thereafter they serve the purpose of being'labels'. In simple words, users are collecting data regarding their activity using their Fitbit, and once in a while, they log some body information such as weight, fat and BMI.

