Chathura Herath 's Blog: April 2010

Monday, April 19, 2010

Google Scholar, I want BibTex all the time !!!

Don't get me wrong, i am extremely grateful for Google scholar especially their import to BibTex feature and yes, i do want to stand on the shoulders of giants.

The BibTex feature is very useful when including related work for a paper but its not there by default and i say they should put it there by default. You do have the option though, to go to preferences and switch it on, but apparently its not associated with my Gmail account and if i login from a different machine i have to switch it on again.

Friday, April 09, 2010

Graham and Brent theorem for speedup when using multiprocessors or multicores

Since we are big into multicores these days and it would be useful to predict or make a conservative estimate the behavior of an application speedup as we throw more and more processors at it. Let us consider the two obvious extreme cases, one favorable the other unfavorable. The are embarrassingly parallel applications and strictly sequential applications respectively and we will have a look at an example in the reverse order.

Strictly Sequential.

Lets consider calculating the famous nth Fibonacci number and lets assume we do NOT know how to solve a second order recurrence using the clever mathematical way so we will just use the formula as it is. So starting from F0 and F1 we can calculate F2 and so fourth upto the nth number. So if we write a program to calculate Fi we can observe that it has a strict data dependency on Fi-1 and Fi-2. So in such data dependent cases there is no room for making things parallel. Such application obviously won't speed up as you throw more processors at the problem.

Fn+2 = Fn+1 + Fn

Embarrassingly parallel

Consider an application that requires adding two large matrices. Addition of two elements from the matrices are totally independent of any other element. So there is no data dependency in such a case and the matrices can be easily split into sub-problems and given to the number of processors that are available and can be made to run in parallel easily. In such applications programs could achieve linear speed up and in some cases with some clever caching techniques even super linear speedups can be achieved.

In the two cases discussed above we can calculate how an application may speedup in the boundary cases, but most applications that we encounter are in between strictly sequential and embarrassingly parallel. Most applications have certain degree of parallelism and some amount of data dependency which forces sequential behavior and analyzing speed up of such application is much harder. In this post we have a look at a clever theorem invented by Graham[1] and Brent[2] independently and one of the MIT talks i listened to consolidated their ideas to a very insightful theorem.

First Some naming conventions.

T_n - Time taken for the application to run when n processors are at its disposal. T_1 represent the sequential case and T_infinity represent that case where application may have as many processors as needed

based on above conventions we could define:

Parallelism = T_1/T_infinity
Speedup for p processors = T_1/T_p

An observation assuming no cache tricks to achieve super-linear speedup:

T_p > T_1/p

This simply mean that by having two processors you cannot cut the time exactly by half, it will always be a little more due to possible data dependencies and of course the extra overhead associated with making the application parallel.

Model

This theorem models a given application using a graph where nodes would represent computational units and edges would represent the data dependencies similar to the workflow idea. Graham and Brent argue that if you have infinite processors available for the application, the time will depend upon the critical path of the graph or longest path of the graph from start to finish. So T_infinity may also looked upon as the critical path time. They argue that the speedup that can be achieved depend on this critical path as well as the number of processors that is available to the scheduler at the execution of the application. Based on these two parameters they propose a upper bound to the time taken by an application when its presented with p processors. Upper bound of time is useful because it may allow you to provision your resources for worst case scenario.

Theorem:

T_p <= T_1/p + T_infinity

Theorem means an application with p processors at most would take time that is addition of sequential time / p and critical path time. Many techniques have since developed to reduce that upper bound but in my view this is the most insightful theorem of its nature and its simplicity makes it quite appealing when making an conservative estimate or during formal analysis.

[1]R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied
Mathematics, 17(2):416{429, March 1969
[2]Richard P. Brent. The parallel evaluation of general arithmetic expressions. Journal of
the ACM, 21(2):201{206, April 1974

Labels: graham and brent theorem, multicore, scheduling, Time complexity, upper bound

Tuesday, April 06, 2010

Reducing debug cycle during Amazon elastic map reduce development

The cost model that Amazon has published for Amazon Elastic Map Reduce is totally unfair during the development process. The minimum billing unit is an hour and these hours add up quickly to run up your bill if you are not careful enough. If you are doing anything serious using Amazon Elastic Map Reduce, that is to say you are running something other than the word count example and you choose not to install Hadoop yourself but rather to develop off the Hadoop in Amazon Elastic Map Reduce, you will end up making lot of debug runs to get the configurations right. In each of these runs if the Hadoop gets launched even for a minute it will charge you for an entire hour times the number of instances you launched. Especially if you are using the Amazon Management Console you will end up having to start a new Job Flow every time you change your application and want to test it. These costs quickly add up if you are not careful, or rather careless.

Tips to reduce the costs

Avoid using the extra large machines

During development avoid using extra large instances because the cost of these are much much higher and because they have 8 cores you will be billed 8 normalized CPU hours when the instance gets launched.

Programatically Launch Job Flow with keep alive

I wrote a blog earlier showing how to launch a Job Flow programatically and in that i showed how to keep the Job Flow and the instances alive after your map reduce application finish. Then you can simply add a job flow step to the already running application. This will not only reduce the debug cycle because the instance boot up time is no longer relevant to the subsequent Job Flow Steps and you can launch multiple map reduce runs as Job Flow Step with in an hour and yet it will cost you only one hour of CPU because you are not shutting down the instances after one run.

Not develop on Amazon Elastic Map Reduce

One option is to install Hadoop locally and test it there before coming the Amazon so you will not end up paying an hours price for every few minutes of debug run you did.

Amazon should provide development instances billed per minute.

Best scenario is amazon either provide cheaper instances for development or bill per minute during development.

Labels: Amazon, cost, elastic map reduce, Hadoop, Map Reduce, minimize cost

Chathura Herath 's Blog

About Me

Monday, April 19, 2010

Google Scholar, I want BibTex all the time !!!

Friday, April 09, 2010

Graham and Brent theorem for speedup when using multiprocessors or multicores

Tuesday, April 06, 2010