Recently we had the opportunity to attend a talk from
Hortonworks’ own Arun Murthy about Hadoop 2.0 and more specifically YARN, the
new Resource Management framework.
YARN (acronym for Yet Another Resource Negociator) offers a
way to develop distributed applications, and provides cluster resource
management, and application life cycle management. It is essentially an OS for
distributed applications of any kind!
Why was YARN developed?
Below are a handful of limitations of Hadoop 1.0:
-
Hadoop 1.0 specifically handles only Map Reduce
jobs. As we saw in the September edition of Tech Cube J, Hadoop may not be the
appropriate choice in certain cases, like iterative computations. As an application developer, how to handle
other programming model paradigms other than Map Reduce, like real-time
analysis?
-
There is no High Availability on the Job
Tracker. If the Job Tracker fails, there is no way to recover from its previous
state. All the jobs must be rerun.
-
Arbitrary limits are put in place on the number
of mappers and reducer slots assigned on each cluster node (set via
configuration parameters mapred.tasktracker.map/reduce.tasks.maximum),
which may not make full use of the cluster’s resources (i.e. some
resources may be idle because of these fixed settings).
-
4000 nodes/40,000 tasks is the current limit in
terms of scalability.
The new way
These limitations are lifted in Hadoop 2.0; in particular:
-
In YARN, you can run any application, not
just Hadoop applications; running Storm is a typical example. Applications
written in languages other than Java can be run. All of these applications can
take advantage of a common management model, with built-in security and other
resource management features. YARN can be used as a general purpose distributed
computing framework.
-
Multiple different
versions of Map Reduce can be run simultaneously. This allows rolling
upgrades.
-
Hadoop jobs will automatically have the ability
to detect the failure/failover of services they are dependent on and have the
ability to pause, retry and recover,
essentially getting rid of the SPOF problem of the Job Tracker in Hadoop 1.0.
-
The cluster’s utilization is improved: the application developer chooses the
resources that he wants to set via the Container, the basic unit of resource
allocations (memory, CPU, etc). The
Container’s role is essentially about virtualization of hardware. The Application
Master asks resources on specific nodes - you re ensured to get the best
possible container with the settings you asked, with good data locality. Yarn
acts like a stock market: it manages the resource bids/asks that are being set.
-
Overall cluster scalability has improved thanks to the Resource manager.
As of Sept, 2013, YARN is running in Production at Yahoo,
over 35K nodes.
Fore more details on YARN, see http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.