|Title||:||Chunking Loops with Irregular Workloads|
|Speaker||:||Indu K (IITM)|
|Details||:||Thu, 23 Mar, 2017 2:00 PM @ BSB 361|
|Abstract:||:||Task parallel languages such as X10 implement dynamic light-weight task parallel execution model, where programmers are encouraged to express the ideal parallelism in the program. Loop chunking has been used as a technique to extract useful parallelism from ideal. Traditional loop chunking techniques assume that iterations in the loop are of similar workload or the behavior of the first few iterations can be used to predict the load in later iterations. However, in loops with non-uniform work distribution, such assumptions do not hold. This problem becomes more complicated in the presence of atomic blocks (critical sections).
In this paper, we propose a new optimization (deep-chunking) that chunks the iterations of the parallel-for-loops, based on the run-time workload of each iteration. We do so by studying the structure of each iteration and formulating a cost-expression to identify the relative workload of the iterations at run-time. The cost-expressions are formulated at compile time and are evaluated by individual threads at run-time. We propose a parallel algorithm that is executed by individual threads to compute their respective chunks, so that the overall execution time gets reduced. We prove that the algorithm is correct and it is a 2-factor approximation. In addition to simple parallel-for-loops, the proposed deep-chunking can also handle loops with atomic blocks which lead to interesting challenges. We have implemented deep-chunking in the x10v2.4.3 compiler and studied its performance on the benchmarks taken from IMSuite. We show that on an average, deep- chunking achieves 50.48%, 21.49%, 26.72%, 32.41%, and 28.84% better performance than un-chunked, cyclic-, block-, dynamic-, and guided-chunking versions of the code, respectively.