How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part 2
That post described some preliminary and important data science tasks like exploratory data analysis and feature engineering performed for the competition, using a Spark cluster deployed on Google Dataproc. It was necessary to separate from clicks_train.csv Machine learning models were trained using train set data and their accuracy was evaluated on validation set data, by comparing the predictions with the ground truth labels (clicks). As we optimize CV model accuracy -- by testing different feature engineering approaches, algorithms and hyperparameters tuning -- we expect to improve our score on the competition Leaderboard (LB) accordingly (test set). The categorical fields whose average CTR presented higher predictive accuracy on CV score were ad_document_id, ad_source_id, ad_publisher_id, ad_advertiser_id, ad_campain_id, document attributes (category_ids, topics_ids, entities_ids) and their combinations with event_country_id, which modeled regional user preferences.
Jul-1-2017, 23:40:11 GMT