Wednesday, September 25, 2013

Hadoop on AWS - A Primer

This derives from a conversation I have had with an architect at AWS; this may help if like me you know all the pieces, but don't know how they all fit together.
The premises of this is the fact that my customer wants to deploy his Hadoop solution in the cloud.

S3 is an object storage - keeps a trillion objects. More reliable than EBS and less costly.

You need to issue get() and set() to get the data from S3. 
The AWS instance brings the data to its local storage (instance store).
If you provision instead EBS volumes to back your instances, it's going to be costly.

A/ Use your own instances and deploy Hadoop yourself.
1/ Keep everything in S3. Ephemeral storage loses all the data after machines are turned off. Need to import data from S3 to local storage, process data, then export to S3. No EBS.
2/ Get EBS volumes. Import data from S3->EBS. Process data, then export data to S3, shut down EBS.

Free to move data from S3->EBS intra-regions. If EBS volume in different region, you must first move data from S3->S3 from region to region, then to EBS.

Have the ability to use other Hadoop tools in the ecosystem.
B/ Use EMR.

 Data gets moved from S3, processed, then exported to S3 again.
Optionally, you could leave the machines on, if the jobs were long running.

No way to use things like Sqoop, Flume, etc.


  1. Excellent post!!! Your article helped to under the future of java development. Being an open source platform, java is integrated in most of the software development industries to create rich featured applications. J2EE Training in Chennai | JAVA Training in Chennai

  2. Thank you for the useful article. It has helped a lot in training my students. Keep writing more.
    big data course in Chennai


Note: Only a member of this blog may post a comment.