Monday, November 25, 2013

Stinger and Tez: a primer

Stinger and Tez: A primer

Recently I had the opportunity to attend a talk from Hortonworks’ own Alan Gates about Tez and Stinger. Here are the notes from this talk.

What is Stinger?

The Stinger initiative aims to redesign Hive to make it what people want today: Hive is currently used for large batch jobs and works great in that sense; but people also want interactive queries, and Hive is too slow today for this. So a big drive is performance, aiming to be 100x faster.
Additionally Hive doesn’t support all of the native SQL features for analytics (e.g. Date calculations, Windowing) so to be able to connect to a BI stack, and give Hive to SQL power users.
Stinger also covers work that is in HDFS and YARN. An example of this is the new data format, ORC. Thus Stinger is not only focused on Hive, it is a cross-project initiative.
This is work done as open source, by Hortonworks, but also SAP, Microsoft, and Facebook.

Why build upon Hive rather than build a new system?

Hive is out there and being used, and works at scale. It s hard to build a new tool that works at scale; it is easy to build a tool that is fast, but works on a small dataset (Gigabyte range).
Hive already manages at scale, and scale is hard. And now the effort is focused on making it faster. It is easier to do this rather than build in scale afterwards..

Also from a vision perspective, users only need one SQL tool, not a fragmented toolkit. Otherwise different tools are going to use different flavors of SQL, different data types, which is going to create an impedance mismatch and going to cause frustration for users. So the focus is to build just one tool, that does both batch and interactive, and have the optimizer make the right decision: is it a query that’s going to run for an hour so I should use batch type operators, or is it going to take 5 sec so I should run it in an interactive way?

Why is SQL compatibility so important?

            SQL is the English of the data world. You can get by pretty much anywhere with English when you are traveling. It is the same way with SQL, it is the language that everyone knows, and there is a workforce of analysts and set of BI tools that speak SQL. It just makes sense that Hadoop can speak SQL.
However Hadoop can not just speak SQL; Hadoop is so much more than that. YARN brings different data processing models: a real time streaming-type models like Storm, a graph processing model like Gyraph, an ETL model like Pig.
Stinger is not only focusing on SQL. Tez and Stinger open up to other tools like Pig, Cascading and make sure everyone can share the benefit.
SQL is how most of the world is going to communicate with Hadoop.

What is Tez and how does it relate to the Stinger initiative?

In Hadoop 1.0, there was only one option for how to execute their data processing: Map Reduce.


Map Reduce is a great paradigm for a whole set of problems, but in a context of a relational system that is going to stream together a set of relational operators, like Pig, Hive or Cascading, Map Reduce is not the design that you would choose if you starting out from a clean slate. That’s when Hadoop 2.0 comes in with YARN, which separates out resource management from execution. The execution moves from the system to user space, where the user can write an application manager that controls how the job is executed, so that is the opportunity for Pig and Hive to write an app manager tailored to them: Tez. Tez is application that runs on YARN, that builds the underlying execution for a parallel data processing engine for running relational operators like Group by, Join, sort. Tez is being evolved from Map Reduce, with strengths like recoverability that are baked in, and still scales and can handle faults. The underplaying execution layer can handle multiple inputs, which is needed for joins. Map Reduce 1.0 only accepts one input, and additional code is needed to shoehorn joins into it. In Tez, joins are supported natively in the execution layer.

Data movement

Also it frees you up in how you move data; in MR you write data to disk in intermediary stages for persistence in HDFS; this is great for large batch jobs, and you don’t want to give that up, but for quick interactive jobs, you want to be able to move data into memory or via sockets so you can stream in data very quickly. The goal behind Tez is to enable both of those, where the optimizer of the application (like Hive, Pig) can makes the choice: should I use socket-based streaming because I know this is going to be really fast, or should I use HDFS-based communication because this job is going to take an hour, and if it blows at the 59th minute I don’t want to loose everything?

Startup times

Also, Tez is about driving down the job startup times: MR was built where jobs would take an hour on average; so if it took 30 sec to start the job it was just noise. This is the way it was designed. But for interactive jobs, it is important to drive down the startup times from seconds into milliseconds.

Chip architecture optimizations

Also, Tez will be rewritten to add internal optimizers in Hive to take advantage of chip architectures; traditional databases have operated on 1 record at a time. But you can now process multiple records at a time.

Buffer caching

The next Stinger phase will be about buffer caching for Hadoop: Hadoop needs to know what it needs to keep in memory vs. what should be on disk. Hadoop was build around the notion that you scan everything on disk every time , and that is a great fit for batches , but some jobs and data sets you need to hold the data in memory in the cache , if for example you are doing some frequent joins with say dimension tables. Web buffer caching strategies will help in that respect.

Cost-base optimizers

Hive needs a true cost-based optimizer for Hive, instead of the current rules-based optimizer.
For example it almost always make sense to execute a filter before a join statement. But today Hive is complex enough that there are options in how it executes things, and you don’t know what the right answer is, so you will need to make estimates on what is the optimal plan to execute a particular statement.

What Tez means for Pig and other tools?

            Building Tez not just in Hive; Pig: their API’s are fairly different; different tools expose different API’s that are more appropriate for different use cases. Underneath, there are only so many things that you do to data: you can either sort it, join it, group it, filter it, or project it. Having separate implementations for this is wasted effort. Tez optimizes the execution of these operators to make sure it is shareable across the stack to different tools.

Wednesday, November 20, 2013

How to install something on multiple machines at once with pssh (parallel ssh)

It is a common practice to need to install and configure something on multiple machines at once. While there are a number of tools to do this, a quick and dirty way to do this is with parallel ssh. i had a bit of a hard time setting this up the first time, so are my notes on how to configure this correctly. This was tested on Ubuntu but may work on other platforms as well. First, install parallel-ssh .
sudo aptitude install pssh
Then you will need to grant passwordless authentication to the remote machines for pssh to work. First, we have to generate an SSH key:œœ
ssh-keygen -t rsa -P ""
The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes). Second, you have to enable SSH access to your local machine with this newly created key. cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys Third, distribute the public key to the other machines:
ssh-copy-id -i $HOME/.ssh/ root@machine2 ssh-copy-id -i $HOME/.ssh/ root@machine3
and you are done. pssh is now renamed parallel-ssh. Let's give it a try:
root@tabitha7:~# parallel-ssh -h ./cassandrahosts -i ls [1] 11:23:32 [SUCCESS] x.x.x.71 tomcat [2] 11:23:32 [SUCCESS] x.x.x.74 jdk-6u45-linux-x64.bin

Wednesday, October 30, 2013

Friday, October 25, 2013

AWS Hadoop cost infrastructure comparison evaluation

Also, quick sample worksheet to calculate cost on AWS:

number of hrs TB replication m1.xlarge disk space(GB) processing space (%)
  1 3 1690 60%
Numder of instances =   3.029585799   15.14792899
Rounded no of instances   4   15
EC2 cost per hour   0.64   0.64
EMR cost per hour    0.12   0.12
Number of hrs   4   4
total per day for compute   12.16   45.6
S3- daily cost $   3   17
Total montly cost   467.2   1880
EBS cost   128   640
no of Hadoop instances   4   6
Hadoop license cost   337   339
EC2 cost monthly   2188.8   3283.2
Total monthly cost   2653.8   4262.2
Desc Cost
EC2 large 0.32
Instances  10
Hrs/month 80
total 256
EMR extra cost rate 0.06 Network price; 200Gig
EMR cost 48
Total EC2+EMR 304
EMR Large m/c instances  10
Hrs/month 80
Cost per AWS calc 304

Friday, October 18, 2013

A few things about the HDP Sandbox for Hadoop

The sandbox is really nice to work with;
With that said, a few tidbits that helped me that i want to share:

- There is a shell access from Ambari, the UI, but sometimes you want to access via ssh;

Dont do this:
$ ssh root@
ssh: Could not resolve hostname nodename nor servname provided, or not known

Do that instead:

urbanlegends-2:~$ ssh -p 2222 root@

Password should be 'hadoop'.

- If you want to use Hive, and you are installing HDP from scratch, surprise, you cannot use Beeswax (as the time of this writing, Oct, 2013), it is not integrated yet ..
So you will need to install Beeswax separately from Ambari.
Documentation is not complete, and you will need to download (via yum install beeswax).

- adding a jar for a Serde;
Even though you add the jar in the Hue UI File Browser, the jar location may not be picked up properly when using Hive at the command line. And Hue hides the actual path from you ..
Workaround: run your select statement from Beeswax. adding the jar resource in Beeswax. It will then tell you where the jar was added in the log.
I.e. : Added resource: /tmp/hue_3792@sandbox_201310151419_resources/hive-contrib- 

- installation of Hue:

1. After creation of Hue user
3. Create a Hue user and either deploy Hue in that user's home directory or under the /usr/share directory.

) documentation omits to say that you need to actually download and install hue..
i.e. this step, mentioned in HDP 1.3 , was forgotten in HDP 2.0:

2. After running the daemon,  via /usr/lib/hue/build/env/bin/supervisor
The IP address needs to remain and the port needs to be a free port (check via netstat). Then the daemon should say something like:
Starting beeswax server on port <port>, talking back to Desktop at <host>
and you can check the UI on the browser. “Desktop” refers to the Hue server (generally the same management node as Ambari).

A few notes:

- Installing g++ : you actually need to install gcc-c++.
i.e. yum install gcc-c++ .

- You can install multiple yum packages at once (in fact, all of the ones listed in the HDP doc) but putting their name all on the same yum install line.

But actually

Hue Integration: as of HDP 2.0, Ambari and Hue are not integrated together. Therefore their users need to be duplicated in each system. You can integrate Hue and Ambari with LDAP(Active directory) , if that is done enterprise users who have access to  have sso in ambari and hue.

 linux boxes will be able to have sso in ambari and hue.

Hue Security: You need to ensure all users created in Hue have access to create Hive jobs. If not, It could be because you do not have /user/<username> directories in HDFS. You have to create user in hdfs before you can use hue , as you need .staging directory for executing map reduce jobs.

Beeswax settings: If there is a specific serde jar which you have to use every time and by all user , you can put same in /usr/lib/hive/lib and restart hue. It will include the directory in class path while starting beeswax. Check beeswax_server.out for more details.

Wednesday, September 25, 2013

Hadoop on AWS - A Primer

This derives from a conversation I have had with an architect at AWS; this may help if like me you know all the pieces, but don't know how they all fit together.
The premises of this is the fact that my customer wants to deploy his Hadoop solution in the cloud.

S3 is an object storage - keeps a trillion objects. More reliable than EBS and less costly.

You need to issue get() and set() to get the data from S3. 
The AWS instance brings the data to its local storage (instance store).
If you provision instead EBS volumes to back your instances, it's going to be costly.

A/ Use your own instances and deploy Hadoop yourself.
1/ Keep everything in S3. Ephemeral storage loses all the data after machines are turned off. Need to import data from S3 to local storage, process data, then export to S3. No EBS.
2/ Get EBS volumes. Import data from S3->EBS. Process data, then export data to S3, shut down EBS.

Free to move data from S3->EBS intra-regions. If EBS volume in different region, you must first move data from S3->S3 from region to region, then to EBS.

Have the ability to use other Hadoop tools in the ecosystem.
B/ Use EMR.

 Data gets moved from S3, processed, then exported to S3 again.
Optionally, you could leave the machines on, if the jobs were long running.

No way to use things like Sqoop, Flume, etc.

Tuesday, September 24, 2013

YARN: A primer


Recently we had the opportunity to attend a talk from Hortonworks’ own Arun Murthy about Hadoop 2.0 and more specifically YARN, the new Resource Management framework.
YARN (acronym for Yet Another Resource Negociator) offers a way to develop distributed applications, and provides cluster resource management, and application life cycle management. It is essentially an OS for distributed applications of any kind!

Why was YARN developed?

Below are a handful of limitations of Hadoop 1.0:
-       Hadoop 1.0 specifically handles only Map Reduce jobs. As we saw in the September edition of Tech Cube J, Hadoop may not be the appropriate choice in certain cases, like iterative computations.  As an application developer, how to handle other programming model paradigms other than Map Reduce, like real-time analysis?
-       There is no High Availability on the Job Tracker. If the Job Tracker fails, there is no way to recover from its previous state. All the jobs must be rerun.
-       Arbitrary limits are put in place on the number of mappers and reducer slots assigned on each cluster node (set via configuration parameters, which may not make full use of the cluster’s resources (i.e. some resources may be idle because of these fixed settings).
-       4000 nodes/40,000 tasks is the current limit in terms of scalability.

The new way

These limitations are lifted in Hadoop 2.0; in particular:

-       In YARN, you can run any application, not just Hadoop applications; running Storm is a typical example. Applications written in languages other than Java can be run. All of these applications can take advantage of a common management model, with built-in security and other resource management features. YARN can be used as a general purpose distributed computing framework.
-       Multiple different versions of Map Reduce can be run simultaneously. This allows rolling upgrades.
-       Hadoop jobs will automatically have the ability to detect the failure/failover of services they are dependent on and have the ability to pause, retry and recover, essentially getting rid of the SPOF problem of the Job Tracker in Hadoop 1.0.
-       The cluster’s utilization is improved: the application developer chooses the resources that he wants to set via the Container, the basic unit of resource allocations (memory, CPU, etc).  The Container’s role is essentially about virtualization of hardware. The Application Master asks resources on specific nodes - you re ensured to get the best possible container with the settings you asked, with good data locality. Yarn acts like a stock market: it manages the resource bids/asks that are being set.
-       Overall cluster scalability has improved thanks to the Resource manager.

As of Sept, 2013, YARN is running in Production at Yahoo, over 35K nodes.