Here are my notes, captured from a talk with Arun Murphy @ Hortonworks about Yarn.
State and limitations of Hadoop 1.x
In Hadoop 1.x, jobs do their own scheduling and life cycle management.
At Yahoo, 4000 nodes has been about the maximum # of nodes before performance seriously starts to go down; after this, synchronization gets coarse. Also, maximum # of tasks run simultaneously is 40,000.
One problem is also that failure kills all running jobs.
Hard partitionning of Maps/reduce slots: this setup is artificial, and results in low utilization (sometimes all Mappers are running, Reducers are empty).
Iterative apps in M/R are typically slow - wrong tool for the job ..
Hadoop 2.x
Yarn: a generic management system.
For cluster resource management, on top of Hdfs2 .
For example: you can have a M/R application that gets 10% of the resources, Storm jobs that get 20%, Spark apps that get 5%, etc ..
Interactive apps can take advantage of Tez.
All of the apps can take advantage of a common management model, with built-in security and other resource management features.
Yarn is currently in Production at Yahoo, on about 5000 nodes.
Also, you dont have to use HDFS at all.
What Yarn offers, is : Cluster resource management, and app life cycle management. It is essentially an OS for distributed applications!
You could also run different versions of Map Reduce simultaneously ..
Yarn becomes user land.
In Yarn, an application is a temporal job.
Apps run in containers. A container, again, could run python, storm, or map reduce.
You choose in your app/container the resources that you want to set. Nothing is fixed, like in AWS. You choose exactly what you want.
In some cases, applications can allocate dynamically more containers, and tear them off - that's the case with Storm on Yarn for example.
There is also a concept of a queue, with ACL for users.
Node Manager is the new name for a Task tracker.
The app master controls the job. You put the code on HDFS.
Container is essentially about virtualization of hardware.
App Master asks resources on specific nodes or specific rack - you re ensured to get the best possible container with the settings you asked, with good data locality.
Yarn is like a stock market: there are bids/asks that are being set.
Hdfs could have been run as an app !!
App master could have multiple containers. Resource manager gives resources (is-parent-of) to App master, which is the parent of its multiple containers potentially.
Today , Yarn is compatible at the binary level with Hadoop 1.x. You can ship your code/jar in it with no changes.
Max_map_tasks configuration attributes are gone in Yarn.
Hoya: HBase on Yarn.
HBase is a user level app. Hoya can bring up HBase on the fly and torn down, dynamically (like an EMR for HBase).
Yarn was coded in Java, but uses native code, and supports both Linux and Windows.
Today the resources it supports are memory and CPU, but in the future GPU and other things could be set also.
Yarn could be used as a virtualization layer for OSes, and spin up VMs ..
Resource Manager has HA.
Thanks so very much for taking your time to create this very useful and informative site. I have learned a lot from your site. Thanks!!
ReplyDeleteHadoop Course in Chennai
Thanks for sharing. Excellent notes
ReplyDelete