Tuesday, September 24, 2013

YARN: A primer

 

Recently we had the opportunity to attend a talk from Hortonworks’ own Arun Murthy about Hadoop 2.0 and more specifically YARN, the new Resource Management framework.
YARN (acronym for Yet Another Resource Negociator) offers a way to develop distributed applications, and provides cluster resource management, and application life cycle management. It is essentially an OS for distributed applications of any kind!


Why was YARN developed?

Below are a handful of limitations of Hadoop 1.0:
-       Hadoop 1.0 specifically handles only Map Reduce jobs. As we saw in the September edition of Tech Cube J, Hadoop may not be the appropriate choice in certain cases, like iterative computations.  As an application developer, how to handle other programming model paradigms other than Map Reduce, like real-time analysis?
-       There is no High Availability on the Job Tracker. If the Job Tracker fails, there is no way to recover from its previous state. All the jobs must be rerun.
-       Arbitrary limits are put in place on the number of mappers and reducer slots assigned on each cluster node (set via configuration parameters mapred.tasktracker.map/reduce.tasks.maximum), which may not make full use of the cluster’s resources (i.e. some resources may be idle because of these fixed settings).
-       4000 nodes/40,000 tasks is the current limit in terms of scalability.

The new way

These limitations are lifted in Hadoop 2.0; in particular:

-       In YARN, you can run any application, not just Hadoop applications; running Storm is a typical example. Applications written in languages other than Java can be run. All of these applications can take advantage of a common management model, with built-in security and other resource management features. YARN can be used as a general purpose distributed computing framework.
-       Multiple different versions of Map Reduce can be run simultaneously. This allows rolling upgrades.
-       Hadoop jobs will automatically have the ability to detect the failure/failover of services they are dependent on and have the ability to pause, retry and recover, essentially getting rid of the SPOF problem of the Job Tracker in Hadoop 1.0.
-       The cluster’s utilization is improved: the application developer chooses the resources that he wants to set via the Container, the basic unit of resource allocations (memory, CPU, etc).  The Container’s role is essentially about virtualization of hardware. The Application Master asks resources on specific nodes - you re ensured to get the best possible container with the settings you asked, with good data locality. Yarn acts like a stock market: it manages the resource bids/asks that are being set.
-       Overall cluster scalability has improved thanks to the Resource manager.


As of Sept, 2013, YARN is running in Production at Yahoo, over 35K nodes.


0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.