October 2013 ~ Big data - tidbits of knowledge

Friday, October 25, 2013

AWS Hadoop cost infrastructure comparison evaluation

Also, quick sample worksheet to calculate cost on AWS:

number of hrs	TB	replication	m1.xlarge disk space(GB)	processing space (%)
	1	3	1690	60%
	5
Numder of instances =		3.029585799		15.14792899
Rounded no of instances		4		15
EC2 cost per hour		0.64		0.64
EMR cost per hour		0.12		0.12
Number of hrs		4		4

total per day for compute		12.16		45.6
S3- daily cost $		3		17
Total montly cost		467.2		1880

EBS cost		128		640
no of Hadoop instances		4		6
Hadoop license cost		337		339
EC2 cost monthly		2188.8		3283.2
Total monthly cost		2653.8		4262.2





		Desc	Cost
		EC2 large	0.32
		Instances	10
		Hrs/month	80
		total	256
		EMR extra cost rate	0.06		Network price; 200Gig
		EMR cost	48
		Total EC2+EMR	304

		EMR Large m/c instances	10
		Hrs/month	80
		Cost per AWS calc	304

Friday, October 18, 2013

A few things about the HDP Sandbox for Hadoop

The sandbox is really nice to work with;
With that said, a few tidbits that helped me that i want to share:

- There is a shell access from Ambari, the UI, but sometimes you want to access via ssh;

Dont do this:

$ ssh root@127.0.0.1:2222
ssh: Could not resolve hostname 127.0.0.1:2222: nodename nor servname provided, or not known

Do that instead:

urbanlegends-2:~$ ssh -p 2222 root@127.0.0.1

Password should be 'hadoop'.

- If you want to use Hive, and you are installing HDP from scratch, surprise, you cannot use Beeswax (as the time of this writing, Oct, 2013), it is not integrated yet ..
So you will need to install Beeswax separately from Ambari.
Documentation is not complete, and you will need to download (via yum install beeswax).

- adding a jar for a Serde;
Even though you add the jar in the Hue UI File Browser, the jar location may not be picked up properly when using Hive at the command line. And Hue hides the actual path from you ..
Workaround: run your select statement from Beeswax. adding the jar resource in Beeswax. It will then tell you where the jar was added in the log.

I.e. : Added resource: /tmp/hue_3792@sandbox_201310151419_resources/hive-contrib-0.11.0.2.0.5.0-67.jar

- installation of Hue:

Documentation:

http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.0.0.2/bk_installing_manually_book/content/rpm-chap-hue-1.html

1. After creation of Hue user

(

3. Create a Hue user and either deploy Hue in that user's home directory or under the /usr/share directory.

) documentation omits to say that you need to actually download and install hue..

i.e. this step, mentioned in HDP 1.3 , was forgotten in HDP 2.0:

http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_installing_manually_book/content/rpm-chap-hue-3.html

Also this: http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_installing_manually_book/content/rpm-chap-hue-4.html

2. After running the daemon, via /usr/lib/hue/build/env/bin/supervisor

The IP address needs to remain 0.0.0.0 and the port needs to be a free port (check via netstat). Then the daemon should say something like:

“Starting beeswax server on port <port>, talking back to Desktop at <host>

“

and you can check the UI on the browser. “Desktop” refers to the Hue server (generally the same management node as Ambari).

A few notes:

- Installing g++ : you actually need to install gcc-c++.

i.e. yum install gcc-c++ .

- You can install multiple yum packages at once (in fact, all of the ones listed in the HDP doc) but putting their name all on the same yum install line.

But actually

http://gethue.tumblr.com/tagged/release

Hue Integration: as of HDP 2.0, Ambari and Hue are not integrated together. Therefore their users need to be duplicated in each system. You can integrate Hue and Ambari with LDAP(Active directory) , if that is done enterprise users who have access to have sso in ambari and hue.

linux boxes will be able to have sso in ambari and hue.

Hue Security: You need to ensure all users created in Hue have access to create Hive jobs. If not, It could be because you do not have /user/<username> directories in HDFS. You have to create user in hdfs before you can use hue , as you need .staging directory for executing map reduce jobs.

Beeswax settings: If there is a specific serde jar which you have to use every time and by all user , you can put same in /usr/lib/hive/lib and restart hue. It will include the directory in class path while starting beeswax. Check beeswax_server.out for more details.

Big data - tidbits of knowledge

Contact

Wednesday, October 30, 2013

Quantified self experiment

Friday, October 25, 2013

AWS Hadoop cost infrastructure comparison evaluation

Friday, October 18, 2013

A few things about the HDP Sandbox for Hadoop

Popular Posts

Recent Posts

Categories

Definition List

Text Widget

Blog Archive