Chathura Herath 's Blog

Friday, April 09, 2010

Graham and Brent theorem for speedup when using multiprocessors or multicores

Since we are big into multicores these days and it would be useful to predict or make a conservative estimate the behavior of an application speedup as we throw more and more processors at it. Let us consider the two obvious extreme cases, one favorable the other unfavorable. The are embarrassingly parallel applications and strictly sequential applications respectively and we will have a look at an example in the reverse order.

Strictly Sequential.

Lets consider calculating the famous nth Fibonacci number and lets assume we do NOT know how to solve a second order recurrence using the clever mathematical way so we will just use the formula as it is. So starting from F0 and F1 we can calculate F2 and so fourth upto the nth number. So if we write a program to calculate Fi we can observe that it has a strict data dependency on Fi-1 and Fi-2. So in such data dependent cases there is no room for making things parallel. Such application obviously won't speed up as you throw more processors at the problem.

Fn+2 = Fn+1 + Fn

Embarrassingly parallel

Consider an application that requires adding two large matrices. Addition of two elements from the matrices are totally independent of any other element. So there is no data dependency in such a case and the matrices can be easily split into sub-problems and given to the number of processors that are available and can be made to run in parallel easily. In such applications programs could achieve linear speed up and in some cases with some clever caching techniques even super linear speedups can be achieved.

In the two cases discussed above we can calculate how an application may speedup in the boundary cases, but most applications that we encounter are in between strictly sequential and embarrassingly parallel. Most applications have certain degree of parallelism and some amount of data dependency which forces sequential behavior and analyzing speed up of such application is much harder. In this post we have a look at a clever theorem invented by Graham[1] and Brent[2] independently and one of the MIT talks i listened to consolidated their ideas to a very insightful theorem.

First Some naming conventions.

T_n - Time taken for the application to run when n processors are at its disposal. T_1 represent the sequential case and T_infinity represent that case where application may have as many processors as needed

based on above conventions we could define:

Parallelism = T_1/T_infinity
Speedup for p processors = T_1/T_p

An observation assuming no cache tricks to achieve super-linear speedup:

T_p > T_1/p

This simply mean that by having two processors you cannot cut the time exactly by half, it will always be a little more due to possible data dependencies and of course the extra overhead associated with making the application parallel.

Model

This theorem models a given application using a graph where nodes would represent computational units and edges would represent the data dependencies similar to the workflow idea. Graham and Brent argue that if you have infinite processors available for the application, the time will depend upon the critical path of the graph or longest path of the graph from start to finish. So T_infinity may also looked upon as the critical path time. They argue that the speedup that can be achieved depend on this critical path as well as the number of processors that is available to the scheduler at the execution of the application. Based on these two parameters they propose a upper bound to the time taken by an application when its presented with p processors. Upper bound of time is useful because it may allow you to provision your resources for worst case scenario.

Theorem:

T_p <= T_1/p + T_infinity

Theorem means an application with p processors at most would take time that is addition of sequential time / p and critical path time. Many techniques have since developed to reduce that upper bound but in my view this is the most insightful theorem of its nature and its simplicity makes it quite appealing when making an conservative estimate or during formal analysis.

[1]R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied
Mathematics, 17(2):416{429, March 1969
[2]Richard P. Brent. The parallel evaluation of general arithmetic expressions. Journal of
the ACM, 21(2):201{206, April 1974

Labels: graham and brent theorem, multicore, scheduling, Time complexity, upper bound

Wednesday, March 31, 2010

Optimal Map tasks for Map Reduce Applications based on Time Complexity ???

Some Analysis

One of the strong points in Hadoop is its ability to efficiently partition the input files using API hooks in the HDFS and calling the Map tasks with each of these partitions. Its useful to understand how to find these task distribution to get the best possible performance achievable. There are few factors to consider when trying to pick the right number of map tasks but the obvious would be to strike a balance between the speedup from the distribution and the overhead of distribution.

Consider the case where time Complexity of the application is O(n) meaning the running time depend entirely on the size of the input, like in the case of WordCount. We can argue that the efficient sizes of the partition that the system could handle would be the deciding factor because the running time is only growing linearly with the input and you have to read the input anyway. So in such cases letting the Hadoop decide on the partition sizes will be the right approach because it will base the partition size based on the optimal block sizes. Why? Following may convince you why.

Lets assume Hadoop distribution overhead is linear with the Map tasks, then Total time would look like the following equation. This may be oversimplified equation because it leaves out the reduction complexity and without any argument assumed the overhead of Hadoop is linear with Map Tasks. But this simplicity would let you have more insight.

Total time = O (Input Size/Map Tasks) + Overhead * Map Tasks

So if the application has linear time complexity O(n) then ;

Total time = (Input Size/Map Tasks)+ Overhead * Map Tasks

Taking the derivative would tell you that Total time is minimum when

Map Tasks = sqrt(Input Size/Overhead)

Since this number is very dependent on the Overhead it make sense to let Hadoop do the partitioning and distribution.

Structured input for Map Reduce

Consider the case where the time complexity of the application is not linear but higher, then the dominant term in the Total time is O (Input Size/Map Tasks). In case the time complexity is O(n^2) =>

Total time = (Input Size/Map Tasks) ^ 2 + Overhead * Map Tasks

Here the first term grows much faster. So minimizing that by increasing the Map tasks is essential. So in such cases you cannot let the Hadoop come up with the number of Map tasks based on the optimal block sizes that it can handle but rather you need to force the number of map tasks upon it.

For example a java ray tracing application which has O(n^2) complexity according to Michael. In this you try to render a scene using ray tracing and this can be easily parallelized because each pixel calculation is independent of the other. When you parallelize it, you will have different segments you want to compute in the input file and you need to force the Map Reduce to calculate those segments in different map tasks. So you need much structured input than that in the word count example because your number of map tasks and your running time depend on it. Letting the partitioning be done to optimize the file sizes in such a compute intensive application would be unwise and in many cases counterproductive. Although it should be noted Hadoop was designed for data intensive parallelism rather than compute intensive parallelism .

Forcing Hadoop to run a map task for each line in input file

Hadoop provides a input formater called NLineInputFormat where each line in the input file would get mapped to a separate map task which allows you to have control over the number of map tasks that you want to force upon the Hadoop framework. Following is the incomplete code to setup the JobConfiguration and the Bold line shows what you need to do to set the NLineInputFormat. So if you know how to optimally map tasks you need to run you can allow those inputs to lines in the input file so Hadoop will create a new map task for each input.

JobConf conf = new JobConf(HadoopRayTracer.class);
conf.setJobName("raytrace");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(NLineInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class)
.......

Labels: each line, Hadoop, Map, Map Reduce, Time complexity

Chathura Herath 's Blog

About Me

Friday, April 09, 2010

Graham and Brent theorem for speedup when using multiprocessors or multicores

Wednesday, March 31, 2010