Improve performance of ML pipelines for wide DataFrames in Apache Spark 2.3

#artificialintelligence 

Apache Spark MLlib's DataFrame-based API provides a simple, yet flexible and elegant framework for creating end-to-end machine learning pipelines. Leveraging the power of Spark's DataFrames and SQL engine, Spark ML pipelines make it easy to link together the phases of the machine learning workflow, from data processing, to feature extraction and engineering, to model training and evaluation. However, while Spark SQL can provide significant performance gains to some parts of the ML workflow, in other areas there are important shortcomings. One of these is that many of the most commonly used Spark ML components operate on a single column at a time. This particularly impacts the common use case of "wide" datasets, where there are many variables or features that typically need to be processed in the same manner (for example, encoding many categorical feature columns or discretizing many numerical feature columns).