Wednesday, September 25, 2013

Hadoop on AWS - A Primer

This derives from a conversation I have had with an architect at AWS; this may help if like me you know all the pieces, but don't know how they all fit together.
The premises of this is the fact that my customer wants to deploy his Hadoop solution in the cloud.

S3 is an object storage - keeps a trillion objects. More reliable than EBS and less costly.

You need to issue get() and set() to get the data from S3. 
The AWS instance brings the data to its local storage (instance store).
If you provision instead EBS volumes to back your instances, it's going to be costly.

A/ Use your own instances and deploy Hadoop yourself.
1/ Keep everything in S3. Ephemeral storage loses all the data after machines are turned off. Need to import data from S3 to local storage, process data, then export to S3. No EBS.
2/ Get EBS volumes. Import data from S3->EBS. Process data, then export data to S3, shut down EBS.

Free to move data from S3->EBS intra-regions. If EBS volume in different region, you must first move data from S3->S3 from region to region, then to EBS.

Have the ability to use other Hadoop tools in the ecosystem.
B/ Use EMR.

 Data gets moved from S3, processed, then exported to S3 again.
Optionally, you could leave the machines on, if the jobs were long running.

No way to use things like Sqoop, Flume, etc.

Tuesday, September 24, 2013

YARN: A primer


Recently we had the opportunity to attend a talk from Hortonworks’ own Arun Murthy about Hadoop 2.0 and more specifically YARN, the new Resource Management framework.
YARN (acronym for Yet Another Resource Negociator) offers a way to develop distributed applications, and provides cluster resource management, and application life cycle management. It is essentially an OS for distributed applications of any kind!

Why was YARN developed?

Below are a handful of limitations of Hadoop 1.0:
-       Hadoop 1.0 specifically handles only Map Reduce jobs. As we saw in the September edition of Tech Cube J, Hadoop may not be the appropriate choice in certain cases, like iterative computations.  As an application developer, how to handle other programming model paradigms other than Map Reduce, like real-time analysis?
-       There is no High Availability on the Job Tracker. If the Job Tracker fails, there is no way to recover from its previous state. All the jobs must be rerun.
-       Arbitrary limits are put in place on the number of mappers and reducer slots assigned on each cluster node (set via configuration parameters, which may not make full use of the cluster’s resources (i.e. some resources may be idle because of these fixed settings).
-       4000 nodes/40,000 tasks is the current limit in terms of scalability.

The new way

These limitations are lifted in Hadoop 2.0; in particular:

-       In YARN, you can run any application, not just Hadoop applications; running Storm is a typical example. Applications written in languages other than Java can be run. All of these applications can take advantage of a common management model, with built-in security and other resource management features. YARN can be used as a general purpose distributed computing framework.
-       Multiple different versions of Map Reduce can be run simultaneously. This allows rolling upgrades.
-       Hadoop jobs will automatically have the ability to detect the failure/failover of services they are dependent on and have the ability to pause, retry and recover, essentially getting rid of the SPOF problem of the Job Tracker in Hadoop 1.0.
-       The cluster’s utilization is improved: the application developer chooses the resources that he wants to set via the Container, the basic unit of resource allocations (memory, CPU, etc).  The Container’s role is essentially about virtualization of hardware. The Application Master asks resources on specific nodes - you re ensured to get the best possible container with the settings you asked, with good data locality. Yarn acts like a stock market: it manages the resource bids/asks that are being set.
-       Overall cluster scalability has improved thanks to the Resource manager.

As of Sept, 2013, YARN is running in Production at Yahoo, over 35K nodes.

Tuesday, September 10, 2013

HDFS 2 - Hadoop 2.x

Different points captured about the next version of HDFS - talk/meeting at Hortonworks

What high availability (HA) means in Hadoop 1.x vs 2.x

In 1.x, HA is implemented by:
- Linux HA
- Shared storage between NN instances.

In 2.x for HA you do not need a shared storage any more.
Nodes are journaled on a disk - any disk: RM, NN active, NN stand by, even DN (although not recommended).

New HDFS features:
-Write pipeline, append mode
- Ability to understand / take advantage of SSD's ; exposed at the app level.
- Removed the 400 M naming space of Hadoop 1.x in the NN, via the NN federation.
- Block management pool - will be moved to the DN in the next 2.x iteration.
- Snapshots. These will be stored in HDFS, in the same system.
- Short circuit reads : going to the local disk directly for faster response.
- Use of NFS v4 - no gateway
- n + k fail-over.
- Use of Protocol buffers (also implemented in next version of HBase). Will replace transparently Writable interface for serialization.
- Stinger / Tez initiative.

Monday, September 9, 2013

Hadoop Yarn meeting notes

Here are my notes, captured from a talk with Arun Murphy @ Hortonworks about Yarn.

State and limitations of Hadoop 1.x
In Hadoop 1.x, jobs do their own scheduling and life cycle management.
At Yahoo, 4000 nodes has been about the maximum # of nodes before performance seriously starts to go down; after this, synchronization gets coarse. Also, maximum # of tasks run simultaneously is 40,000.
One problem is also that failure kills all running jobs.
Hard partitionning of Maps/reduce slots: this setup is artificial, and results in low utilization (sometimes all Mappers are running, Reducers are empty).
Iterative apps in M/R are typically slow - wrong tool for the job ..

Hadoop 2.x
Yarn: a generic management system.
For cluster resource management, on top of Hdfs2 .
For example: you can have a M/R application that gets 10% of the resources, Storm jobs that get 20%, Spark apps that get 5%, etc ..
Interactive apps can take advantage of Tez.
All of the apps can take advantage of a common management model, with built-in security and other resource management features.

Yarn is currently in Production at Yahoo, on about 5000 nodes.

Also, you dont have to use HDFS at all.

What Yarn offers, is : Cluster resource management, and app life cycle management. It is essentially an OS for distributed applications!
You could also run different versions of Map Reduce simultaneously ..
Yarn becomes user land.

In Yarn, an application is a temporal job.
Apps run in containers. A container, again, could run python, storm, or map reduce.
You choose in your app/container the resources that you want to set. Nothing is fixed, like in AWS. You choose exactly what you want.
In some cases, applications can allocate dynamically more containers, and tear them off - that's the case with  Storm on Yarn for example.

There is also a concept of a queue, with ACL for users.

Node Manager is the new name for a Task tracker.
The app master controls the job. You put the code on HDFS.
Container is essentially about virtualization of hardware.
App Master asks resources on specific nodes or specific rack - you re ensured to get the best possible container with the settings you asked, with good data locality.
Yarn is like a stock market: there are bids/asks that are being set.

Hdfs could have been run as an app !!

App master could have multiple containers. Resource manager gives resources (is-parent-of) to App master, which is the parent of its multiple containers potentially.
Today , Yarn is compatible at the binary level with Hadoop 1.x. You can ship your code/jar in it with no changes.

Max_map_tasks configuration attributes are gone in Yarn.

Hoya: HBase on Yarn.
HBase is a user level app. Hoya can bring up HBase on the fly and torn down, dynamically (like an EMR for HBase).

Yarn was coded in Java, but uses native code, and supports both Linux and Windows.
Today the resources it supports are memory and CPU, but in the future GPU and other things could be set also.
Yarn could be used as a virtualization layer for OSes, and spin up VMs ..
Resource Manager has HA.