Wednesday, September 25, 2013

Hadoop on AWS - A Primer

This derives from a conversation I have had with an architect at AWS; this may help if like me you know all the pieces, but don't know how they all fit together.
The premises of this is the fact that my customer wants to deploy his Hadoop solution in the cloud.

Storage:
S3 is an object storage - keeps a trillion objects. More reliable than EBS and less costly.

You need to issue get() and set() to get the data from S3. 
The AWS instance brings the data to its local storage (instance store).
If you provision instead EBS volumes to back your instances, it's going to be costly.

Choices
A/ Use your own instances and deploy Hadoop yourself.
1/ Keep everything in S3. Ephemeral storage loses all the data after machines are turned off. Need to import data from S3 to local storage, process data, then export to S3. No EBS.
2/ Get EBS volumes. Import data from S3->EBS. Process data, then export data to S3, shut down EBS.

Free to move data from S3->EBS intra-regions. If EBS volume in different region, you must first move data from S3->S3 from region to region, then to EBS.

Have the ability to use other Hadoop tools in the ecosystem.
B/ Use EMR.

 Data gets moved from S3, processed, then exported to S3 again.
Optionally, you could leave the machines on, if the jobs were long running.

No way to use things like Sqoop, Flume, etc.





4 comments:

  1. There are lots of information about latest technology and how to get trained in them, like Big Data Course in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Big Data Training Chennai). By the way you are running a great blog. Thanks for sharing this.

    Big Data Training in Chennai | Big Data Training

    ReplyDelete
  2. Excellent post!!! Your article helped to under the future of java development. Being an open source platform, java is integrated in most of the software development industries to create rich featured applications. J2EE Training in Chennai | JAVA Training in Chennai

    ReplyDelete
  3. Thank you for the useful article. It has helped a lot in training my students. Keep writing more.
    big data course in Chennai

    ReplyDelete
  4. Cloud is one of the tremendous technology that any company in this world would rely on(Saesforce crm Training in Chennai). Using this technology many tough tasks can be accomplished easily in no time. Your content are also explaining the same(sales cloud consultant training in chennai). Thanks for sharing this in here. You are running a great blog, keep up this good work.

    ReplyDelete

Note: Only a member of this blog may post a comment.