Integrating Apache Spark and HANA


Although the growth in volume of data sitting in HDFS has been incredible and continues to grow exponentially, much of this has been contextual data – e.g., social data, click-stream data, sensor data, logs, 3rd party data sources – and historical data. Real-time operational data – e.g., data from foundational enterprise applications such as ERP (Enterprise Resource Planning), CRM (Customer Relationship Management), and Supply Chain and Inventory Management (SCM) systems – has historically been maintained separately and moving data across in either direction to allow for analytics across the data set is cumbersome at best. A similar mechanism works for HANA users, where TGFs (Table Generating Functions) and Custom UDFs (User Defined functions) provide access to the full breadth of Spark's capabilities through the Smart Data Access functionality. That's why they've been adamant that any SAP Spark distribution is a Certified Spark Distribution – and hence capable of supporting the rapidly growing set of "Certified on Spark" applications and the development ecosystem.