Cost Based Optimizer in Apache Spark 2.2 - The Databricks Blog

Sep-5-2017, 08:35:03 GMT–@machinelearnbot

This is a joint engineering effort between Databricks' Apache Spark engineering team (Sameer Agarwal and Wenchen Fan) and Huawei's engineering team (Ron Hu and Zhenhua Wang) Apache Spark 2.2 recently shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, average/max length, etc.) to improve the quality of query execution plans. Leveraging these statistics helps Spark to make better decisions in picking the most optimal query plan. Examples of these optimizations include selecting the correct build side in a hash-join, choosing the right join type (broadcast hash-join vs. shuffled hash-join) or adjusting a multi-way join order, among others. In this blog, we'll take a deep dive into Spark's Cost Based Optimizer (CBO) and discuss how Spark collects and stores these statistics, optimizes queries, and show its performance impact on TPC-DS benchmark queries. At its core, Spark's Catalyst optimizer is a general library for representing query plans as trees and sequentially applying a number of optimization rules to manipulate them.

artificial intelligence, information retrieval query processing, statistics, (15 more...)

@machinelearnbot

Sep-5-2017, 08:35:03 GMT

News Web Page

Add feedback

Technology:
- Information Technology
  - Artificial Intelligence
    - Natural Language > Information Retrieval
      - Query Processing (1.00)
    - Representation & Reasoning (1.00)
  - Databases (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found