My Photo
Location: Bloomington, Indiana, United States

Tuesday, April 06, 2010

Reducing debug cycle during Amazon elastic map reduce development

The cost model that Amazon has published for Amazon Elastic Map Reduce is totally unfair during the development process. The minimum billing unit is an hour and these hours add up quickly to run up your bill if you are not careful enough. If you are doing anything serious using Amazon Elastic Map Reduce, that is to say you are running something other than the word count example and you choose not to install Hadoop yourself but rather to develop off the Hadoop in Amazon Elastic Map Reduce, you will end up making lot of debug runs to get the configurations right. In each of these runs if the Hadoop gets launched even for a minute it will charge you for an entire hour times the number of instances you launched. Especially if you are using the Amazon Management Console you will end up having to start a new Job Flow every time you change your application and want to test it. These costs quickly add up if you are not careful, or rather careless.

Tips to reduce the costs

Avoid using the extra large machines

During development avoid using extra large instances because the cost of these are much much higher and because they have 8 cores you will be billed 8 normalized CPU hours when the instance gets launched.

Programatically Launch Job Flow with keep alive

I wrote a blog earlier showing how to launch a Job Flow programatically and in that i showed how to keep the Job Flow and the instances alive after your map reduce application finish. Then you can simply add a job flow step to the already running application. This will not only reduce the debug cycle because the instance boot up time is no longer relevant to the subsequent Job Flow Steps and you can launch multiple map reduce runs as Job Flow Step with in an hour and yet it will cost you only one hour of CPU because you are not shutting down the instances after one run.

Not develop on Amazon Elastic Map Reduce

One option is to install Hadoop locally and test it there before coming the Amazon so you will not end up paying an hours price for every few minutes of debug run you did.

Amazon should provide development instances billed per minute.

Best scenario is amazon either provide cheaper instances for development or bill per minute during development.

Labels: , , , , ,


Post a Comment

<< Home