Since we are big into multicores these days and it would be useful to predict or make a conservative estimate the behavior of an application speedup as we throw more and more processors at it. Let us consider the two obvious extreme cases, one favorable the other unfavorable. The are embarrassingly parallel applications and strictly sequential applications respectively and we will have a look at an example in the reverse order.

Strictly Sequential.Lets consider calculating the famous nth Fibonacci number and lets assume we do NOT know how to solve a second order recurrence using the clever mathematical way so we will just use the formula as it is. So starting from F0 and F1 we can calculate F2 and so fourth upto the nth number. So if we write a program to calculate Fi we can observe that it has a strict data dependency on Fi-1 and Fi-2. So in such data dependent cases there is no room for making things parallel. Such application obviously won't speed up as you throw more processors at the problem.

Fn+2 = Fn+1 + FnEmbarrassingly parallelConsider an application that requires adding two large matrices. Addition of two elements from the matrices are totally independent of any other element. So there is no data dependency in such a case and the matrices can be easily split into sub-problems and given to the number of processors that are available and can be made to run in parallel easily. In such applications programs could achieve linear speed up and in some cases with some clever caching techniques even super linear speedups can be achieved.

In the two cases discussed above we can calculate how an application may speedup in the boundary cases, but most applications that we encounter are in between strictly sequential and embarrassingly parallel. Most applications have certain degree of parallelism and some amount of data dependency which forces sequential behavior and analyzing speed up of such application is much harder. In this post we have a look at a clever theorem invented by Graham[1] and Brent[2] independently and one of the MIT talks i listened to consolidated their ideas to a very insightful theorem.

First Some naming conventions.

T_n - Time taken for the application to run when n processors are at its disposal. T_1 represent the sequential case and T_infinity represent that case where application may have as many processors as needed based on above conventions we could define:

Parallelism = T_1/T_infinitySpeedup for p processors = T_1/T_pAn observation assuming no cache tricks to achieve super-linear speedup:

T_p > T_1/pThis simply mean that by having two processors you cannot cut the time exactly by half, it will always be a little more due to possible data dependencies and of course the extra overhead associated with making the application parallel.

ModelThis theorem models a given application using a graph where nodes would represent computational units and edges would represent the data dependencies similar to the workflow idea. Graham and Brent argue that if you have infinite processors available for the application, the time will depend upon the critical path of the graph or longest path of the graph from start to finish. So T_infinity may also looked upon as the critical path time. They argue that the speedup that can be achieved depend on this critical path as well as the number of processors that is available to the scheduler at the execution of the application. Based on these two parameters they propose a upper bound to the time taken by an application when its presented with p processors. Upper bound of time is useful because it may allow you to provision your resources for worst case scenario.

Theorem:

T_p <= T_1/p + T_infinityTheorem means an application with p processors at most would take time that is addition of sequential time / p and critical path time. Many techniques have since developed to reduce that upper bound but in my view this is the most insightful theorem of its nature and its simplicity makes it quite appealing when making an conservative estimate or during formal analysis.

[1]R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied

Mathematics, 17(2):416{429, March 1969

[2]Richard P. Brent. The parallel evaluation of general arithmetic expressions. Journal of

the ACM, 21(2):201{206, April 1974

Labels: graham and brent theorem, multicore, scheduling, Time complexity, upper bound