The execution of data intensive analysis workflows in a multi-cloud environment, such as the World Large hadron collider Computing Grid (WLCG) at CERN, requires a large amount of input data, which is stored in multiple storage elements. The turnaround time taken by an individual analysis workflow running on an edge machine is mostly affected by the data reading time. Minimizing the data reading time can improve the overall efficiency of the data analysis process. To overcome this problem, we have used Speculative Scheduling to optimize the multi-cloud analysis workflows by intelligently streaming data before a task arrives for execution at the edge machine. We propose an Event System (ES) which is an in-memory Serverless process responsible for proactively providing input data to the workflow processes. It prefetches the data from the storage elements to the memory of the edge machine, which executes the workflow. Using locality aware scheduling and prefetching algorithms, it performs Speculative Scheduling on the basis of the evaluation of historic execution logs using the Bayesian Inference model. The Serverless ES learns about the incoming jobs ahead of time and makes use of intelligent data streaming to supply data to these jobs, thus reducing the overall scheduling and data access latencies and leading to significant improvements in the overall turnaround time. We have evaluated the proposed system using a large analysis workflow from High Energy Physics (HEP) by emulating the WLCG infrastructure in a controlled environment. The results have shown that by using speculative and locality aware scheduling techniques, significant improvements (i.e. over 30%) can be achieved in the execution of data intensive workflows in the cloud environment.
History
Author affiliation
College of Science & Engineering
Comp' & Math' Sciences