Reasoning on Data Partitioning for Single-Round Multi-Join Evaluation in Massively Parallel Systems

Feb-21-2017, 19:40:20 GMT–Communications of the ACM

Evaluating queries over massive amounts of data is a major challenge in the big data era. Modern massively parallel systems, such as, Spark, organize query answering as a sequence of rounds each consisting of a distinct communication phase followed by a computation phase. The communication phase redistributes data over the available servers, while in the subsequent computation phase each server performs the actual computation on its local data. There is a growing interest in single-round algorithms for evaluating multiway joins where data is first reshuffled over the servers and then evaluated in a parallel but communication-free way. As the amount of communication induced by a reshuffling of the data is a dominating cost in such systems, we introduce a framework for reasoning about data partitioning to detect when we can avoid the data reshuffling step.

artificial intelligence, natural language, question answering, (13 more...)

Communications of the ACM

Feb-21-2017, 19:40:20 GMT

Journals Web Page

Add feedback

Technology:
- Information Technology
  - Architecture > Distributed Systems (0.74)
  - Artificial Intelligence > Natural Language
    - Question Answering (0.38)