Productionizing Distributed XGBoost to Train Deep Tree Models with Large Data Sets at Uber

#artificialintelligence 

Zero values in SparseVectors are treated by XGBoost on Apache Spark as missing values (defaults to Float.NaN) whereas zeroes in DenseVectors are simply treated as zeros. Vector storage in Apache Spark ML is implicitly optimized, so a vector array is stored as a SparseVector or DenseVector based on space efficiency. If an ML practitioner tries to feed a DenseVector at inference time to a model that is trained on SparseVector or vice versa, XGBoost does not provide any warning and the prediction input will likely go into unexpected branches due to the way zeroes are stored, resulting in inconsistent predictions. Hence, it is critical that the storage structure input remains consistent between serving and training times.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found