Spark Technology Center

#artificialintelligence 

One of the main goals of the machine learning team here at the Spark Technology Center is to continue to evolve Apache Spark as the foundation for end-to-end, continuous, intelligent enterprise applications. While working on adding multi-class logistic regression to Spark ML (part of the ongoing push towards parity between ml and mllib), STC team member Seth Hendrickson realized that, due to the way that Spark automatically serializes data when inter-node communication is required (e.g. during a reduce or aggregation operation), the aggregation step of the logistic regression training algorithm resulted in 3x more data being communicated than necessary. What does it mean when we refer to Apache Spark as the "foundation for end-to-end, continuous, intelligent enterprise applications"?