First Steps in Machine Learning with Apache Spark
Apache Spark is one of the main tools for data processing and analysis in the BigData context. It's a very complete (and complex) data processing framework, with functionalities that can be roughly divided into four groups: SparkSQL & DataFrames, the all-purpose data processing needs; Spark Structured Streaming, used to handle data-streams; Spark MLlib, for machine learning and data science and GraphX, the graph processing API. I've already featured the first two in other posts: creating an ETL process for a Data Warehouse and integrating Spark and Kafka for stream processing. Today is the time for the third one -- Let's play with Machine Learning using Spark MLlib. Machine Learning has a special place in my heart, because it was my entrance door to the data science field and, as probably many of yours, I started it with the classic Scikit-Learn library.
Jan-5-2023, 19:05:21 GMT
- Technology: