Collaborating Authors

Spark tutorial: Get started with Apache Spark


Apache Spark has become the de facto standard for processing data at scale, whether for querying large datasets, training machine learning models to predict future trends, or processing streaming data. In this article, we'll show you how to use Apache Spark to analyze data in both Python and Spark SQL. And we'll extend our code to support Structured Streaming, the new current state of the art for handling streaming data within the platform. We'll be using Apache Spark 2.2.0 here, but the code in this tutorial should also work on Spark 2.1.0

Hands on Introduction to Apache SPARK with IBM DSX and Message Hub on IBM Cloud


This IBM sponsored Proof of Technology workshop will give clients a full day of education on Apache Spark, IBM Data Science Experience (DSX) and Message Hub, Apache Kafka based service on IBM Cloud. The PoT will have a detailed overview of Apache Spark, IBM DSX and Message Hub including hands on exercises.

Data pipelines on Spark and Kubernetes


If you're running data pipelines and workflows to get data from one location to the data lake, that usually means that the team will need to process huge amounts of data. To do this in a scalable way and to handle complex computation steps across a large amount of data (effectively from a cost perspective), Kubernetes is a great choice for scheduling Spark jobs, compared to YARN. Apache Spark is a framework that can quickly perform processing tasks on very large data sets, and Kubernetes is a portable, extensible, open-source platform for managing and orchestrating the execution of containerized workloads and services across a cluster of multiple machines. From an architectural perspective, when you submit a Spark application, one is directly interacting with Kubernetes, the API server, which will schedule the driver pod, the Spark driver container. Then the Spark driver and the Kubernetes Cluster will talk to each other to request and launch Spark executors.

Cisco's Spark Board looks like an iPad -- and acts like one


The Spark Board meeting device that Cisco Systems introduced on Tuesday is not so much a whiteboard or a videoconferencing screen as a giant tablet that everyone in the room can share. There's even a "home" button in the center of the bottom bezel that takes you back to the main menu. If Apple didn't have a partnership with Cisco, you might even expect it to accuse the networking giant of copying its iPad design. But Apple and Cisco are in fact working together, so closely that iPhones can work with the Spark Board a little more smoothly than other phones do. And in developing the new all-in-one device, Cisco focused on simplicity and ease of use, which haven't exactly been hallmarks of the networking giant's products up to now.

Running Peta-Scale Spark Jobs on Object Storage Using S3 Select


When one looks at the amazing roster of talks for most data science conferences what you don't see is a lot of discussion on how to leverage object storage. On some level you would expect to -- ultimately if you want to run your Spark or Presto job on peta-scale data sets and have it be available to your applications in the public or private cloud -- this would be the logical storage architecture. While logical, there has been a catch, at least historically, and that is object storage wasn't performant enough to actually make running Spark jobs feasible. With the advent of a modern, cloud native approaches that changes and the implications for the Apache Spark community are pretty significant. At the heart of this change is the extension of the S3 API to include SQL query capabilities, S3 Select.