Provisioning Hadoop machines
Recently I had a customer ask what kind of machine to
purchase to be used in a Hadoop environment, and what configuration to use. The
answer to this can be essentially derived from some simple calculations that I
want to write about and demonstrate.
The number of machines, and specs of the machines, depends
on a few factors: the volume of data (obviously), the data retention policy
(how much can you afford to keep before throwing away), the type of workload
you have (data science/CPU driven vs “vanilla” use case/IO-bound), and also the
data storage mechanism (data container, type of compression used if any). We
have to make some assumptions from the beginning; otherwise there are just too
many parameters to deal with. These assumptions drive the data nodes
configuration.
The other types of machines (Name Node/Job tracker, in
Hadoop 1) will need different specs, and are generally more straightforward. We’ll
just talk about data nodes in this post.
Capacity planning for your data
The number
of machines to purchase will depend on the volume of data to store and analyze,
which will drive the number of spinning disks to get on a per machine basis
(usually a fixed number of hard drives/machine). The below applies mainly to Hadoop 1.x versions. We will talk about Yarn later on.
Capacity planning usually flows from a top-down approach of understanding:
- How many nodes you need
- What's the capacity of each node, on the CPU side
- What's the capacity of each node, on the memory side.
Capacity planning usually flows from a top-down approach of understanding:
- How many nodes you need
- What's the capacity of each node, on the CPU side
- What's the capacity of each node, on the memory side.
Let's do a back-of-the-envelope calculation. This is usually the first estimation you need to make when assessing what you need and budget for the machines.
Data nodes
The HDFS’ configuration is usually set up to replicate the
data 3 ways. So you will need 3x the actual storage capacity for your data. In
addition, you will need to sandbag the machine capacity for temporary storage
for computation (i.e. storage for transient Map outputs stays local to the
machine, it doesn’t get stored on HDFS. Also, local storage for compression is
needed). A good rule of thumb is to keep the disks at 70% capacity. Then we
also need to take into account the compression ratio.
Let’s take an example:
Say we have 70Tb of raw data to store on a yearly basis
(i.e. moving window of 1 year). So after compression (say, with Gzip with a 60% ratio) we will
get:
- 70 – (70 * 60%) = 28Tb
- that we will multiply by 3x = 84Tb,
- but keep 70% capacity: 84Tb = x * 70% thus x = 84/70% = 120Tb is the value we need for capacity planning for data durability, for 70Tb of raw data.
Number of nodes
Here are the recommended specifications for DataNode/TaskTrackers in a
balanced Hadoop cluster from
Cloudera:
- 12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration (no RAID, please!)
- multi-core CPUs (say, 12), running at least 2-2.5GHz
So let’s divide up the value we have in capacity planning by
the number of hard disks we need in a way that makes sense: 120Tb/12 1Tb = 10
nodes.
How about # of tasks for each node?
Number of tasks per node
First, let's figure out the # of tasks per node:
- Usually count 1 core per task. If the job is not too heavy on CPU, then the number of tasks can be greater than the number of cores.
- Example: 12 cores, jobs use ~75% of CPU
- We are starting with 12 cores per machine. Let's assign free slots= 14 (slightly > # of cores is a good rule of thumb), maxMapTasks=8, maxReduceTasks=6.
- Again, this changes in the context of YARN.
Memory
Now let's figure out the memory we can assign to these tasks. By default, the tasktracker and datanode take up each 1 GB of
RAM per default.
- For each task, calculate mapred.child.java.opts (200MB per default) of RAM.
- In addition, count 2 GB for the OS. So say, having 24 Gigs of memory available:
- 24-2= 22 Gig available for our 14 tasks – thus we can assign 1.5 Gig for each of our tasks (14 * 1.5 = 21 Gigs).
Yarn
In YARN, the arbitrary fixed limits put in place on
the number of mappers and reducer slots assigned on each cluster node disappears:
the notion of fixed slots has been discarded, and resources are now configured
in terms of amounts of memory (in megabytes) and CPU (“v-cores”). Instead, YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores,
which control the amount of memory and CPU on each node, both available to both
maps and reduces. If configuring these manually, simply set these to the amount
of memory and number of cores on the
machine after subtracting out resources needed for other services. See http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_mapreduce_to_yarn_migrate.html
for more details.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.