A huge amount of softwares are using collection processing APIs of functional languages (such as Scala collections) for their data analytics parts. These softwares fail to deal with a big amount of data mainly because of two reasons: 1) The data does not fit into the memory. 2) The computation is performed very slowly.
Apache Sparkis a fast and general engine for large- scale data processing. Spark exposes an API similar to Scala collections. Instead of the notion of local collections, Spark has the notion of RDDs (Resilient Distributed Datasets). In essence, RDD is a distributed collection in which the col- lection operations are transformations that are executed on different worker nodes.
Although RDD transformations are very similar to Scala collection operations, it is not very straightforward to con- vert programs written using Scala collections to the ones using Spark RDDs. There are two main reasons for this problem: 1) There is not always a one-to-one correspon- dence between Scala collection operations and Spark RDD transformations. 2) For different input data sizes, we have to use different algorithms to achieve the best performance. For query processing we cover different algorithms in Sec- tion 4.
In this project, we suggest to use synthesis techniques to synthesize an appropriate algorithm for the given input data and the given distributed platform. This work is inspired by OCAS. The contributions of this work is summarized as follows:
- Deriving an appropriate cost model and platform model compatible with OCAS and precise enough for a dis- tributed setting.
- Additional transformation rules for generating programs representing different distributed algorithms for query processing.
- Generating Spark code for the synthesized programs.
- Experimental results for different Spark programs.
Next, we discuss about the high-level architecture of our synthesizer.
The input program is given in OCAL(which is very similar to Scala collections). The input program specifica- tion together with a given memory hierarchy specification are passed to the synthesizer. The synthesizer produces an
illustration not visible in this excerpt
Figure 1: The architecture of the query synthesizer
optimized Spark program in six phases which is shown in Figure 1:
1. The program specification together with a specification of the memory hierarchy and the information about the input data is passed into a memory analysis component.
2. The memory analyser ensures that the program is memory- safe, which means:
- No expression exceeds the amount of capacity of its corresponding memory node.
- Transferring data from one memory node to another one, does not jump through the memory hierarchy.
For example, if we have to transfer data from Cache to HDD, it should go first from Cache to RAM and then from RAM to HDD.
The memory analyser annotates every expression of the given program with their corresponding memory location.
3. This annotated program is passed to the synthesizer. The synthesizer uses the given transformation rules, searching algorithm, input information, as well as the cost model for the given platform. The synthesizer is built on top of SC (Systems Compiler), an extensible optimizing compiler.
4. The synthesizer applies the given transformation rules and based on the given searching algorithm, explores the search space of semantically equivalent programs. Fur- thermore, based on the given memory hierarchy, the syn- thesizer drops the intermediate programs which are not memory-safe. Based on the given cost model, the syn- thesizer chooses the program with the best runtime cost. This program is chosen as the optimal OCAL program.
illustration not visible in this excerpt
Figure 2: The memory hierarchy specification for dis- tributed setting
5. The optimal OCAL program is passed to the synthesizer.
6. The synthesizer uses the code generation facilities of SC to generate Spark code for the optimal OCAL program.
Next, we show how we extend OCAS to be adaptable to a distributed setting.