Wednesday, October 30, 2013

Friday, October 25, 2013

AWS Hadoop cost infrastructure comparison evaluation

Also, quick sample worksheet to calculate cost on AWS:

number of hrs TB replication m1.xlarge disk space(GB) processing space (%)
  1 3 1690 60%
Numder of instances =   3.029585799   15.14792899
Rounded no of instances   4   15
EC2 cost per hour   0.64   0.64
EMR cost per hour    0.12   0.12
Number of hrs   4   4
total per day for compute   12.16   45.6
S3- daily cost $   3   17
Total montly cost   467.2   1880
EBS cost   128   640
no of Hadoop instances   4   6
Hadoop license cost   337   339
EC2 cost monthly   2188.8   3283.2
Total monthly cost   2653.8   4262.2
Desc Cost
EC2 large 0.32
Instances  10
Hrs/month 80
total 256
EMR extra cost rate 0.06 Network price; 200Gig
EMR cost 48
Total EC2+EMR 304
EMR Large m/c instances  10
Hrs/month 80
Cost per AWS calc 304

Friday, October 18, 2013

A few things about the HDP Sandbox for Hadoop

The sandbox is really nice to work with;
With that said, a few tidbits that helped me that i want to share:

- There is a shell access from Ambari, the UI, but sometimes you want to access via ssh;

Dont do this:
$ ssh root@
ssh: Could not resolve hostname nodename nor servname provided, or not known

Do that instead:

urbanlegends-2:~$ ssh -p 2222 root@

Password should be 'hadoop'.

- If you want to use Hive, and you are installing HDP from scratch, surprise, you cannot use Beeswax (as the time of this writing, Oct, 2013), it is not integrated yet ..
So you will need to install Beeswax separately from Ambari.
Documentation is not complete, and you will need to download (via yum install beeswax).

- adding a jar for a Serde;
Even though you add the jar in the Hue UI File Browser, the jar location may not be picked up properly when using Hive at the command line. And Hue hides the actual path from you ..
Workaround: run your select statement from Beeswax. adding the jar resource in Beeswax. It will then tell you where the jar was added in the log.
I.e. : Added resource: /tmp/hue_3792@sandbox_201310151419_resources/hive-contrib- 

- installation of Hue:

1. After creation of Hue user
3. Create a Hue user and either deploy Hue in that user's home directory or under the /usr/share directory.

) documentation omits to say that you need to actually download and install hue..
i.e. this step, mentioned in HDP 1.3 , was forgotten in HDP 2.0:

2. After running the daemon,  via /usr/lib/hue/build/env/bin/supervisor
The IP address needs to remain and the port needs to be a free port (check via netstat). Then the daemon should say something like:
Starting beeswax server on port <port>, talking back to Desktop at <host>
and you can check the UI on the browser. “Desktop” refers to the Hue server (generally the same management node as Ambari).

A few notes:

- Installing g++ : you actually need to install gcc-c++.
i.e. yum install gcc-c++ .

- You can install multiple yum packages at once (in fact, all of the ones listed in the HDP doc) but putting their name all on the same yum install line.

But actually

Hue Integration: as of HDP 2.0, Ambari and Hue are not integrated together. Therefore their users need to be duplicated in each system. You can integrate Hue and Ambari with LDAP(Active directory) , if that is done enterprise users who have access to  have sso in ambari and hue.

 linux boxes will be able to have sso in ambari and hue.

Hue Security: You need to ensure all users created in Hue have access to create Hive jobs. If not, It could be because you do not have /user/<username> directories in HDFS. You have to create user in hdfs before you can use hue , as you need .staging directory for executing map reduce jobs.

Beeswax settings: If there is a specific serde jar which you have to use every time and by all user , you can put same in /usr/lib/hive/lib and restart hue. It will include the directory in class path while starting beeswax. Check beeswax_server.out for more details.