This derives from a conversation I have had with an architect at AWS; this may help if like me you know all the pieces, but don't know how they all fit together.
The premises of this is the fact that my customer wants to deploy his Hadoop solution in the cloud.
Storage:
The premises of this is the fact that my customer wants to deploy his Hadoop solution in the cloud.
Storage:
S3 is an object storage - keeps a trillion objects. More reliable than EBS and less costly.
You need to issue get() and set() to get the data from S3.
The AWS instance brings the data to its local storage (instance store).
If you provision instead EBS volumes to back your instances, it's going to be costly.
Choices
A/ Use your own instances and deploy Hadoop yourself.
A/ Use your own instances and deploy Hadoop yourself.
1/ Keep everything in S3. Ephemeral storage loses all the data after machines are turned off. Need to import data from S3 to local storage, process data, then export to S3. No EBS.
2/ Get EBS volumes. Import data from S3->EBS. Process data, then export data to S3, shut down EBS.
Free to move data from S3->EBS intra-regions. If EBS volume in different region, you must first move data from S3->S3 from region to region, then to EBS.
Have the ability to use other Hadoop tools in the ecosystem.
B/ Use EMR.
Data gets moved from S3, processed, then exported to S3 again.
Optionally, you could leave the machines on, if the jobs were long running.
No way to use things like Sqoop, Flume, etc.