Sunday, August 18, 2013

Cassandra or Hadoop on a Just A Bunch Of Disks setup

I have been told (and implemented) that Hadoop nodes needn't be installed with RAID on, because that is overkill given the inherent replication of data in Hadoop; this also goes for Cassandra ( and i am assuming for other noSQL db's as well).
This is only assuming that as your data grows, you will be adding brand new nodes to your cluster / ring. This said, that means adding a brand new machine (VM or physical) with disks, RAM, ethernet connection, etc . This may not be always possible for businesses; in contrast, talking to a small startup recently, they had a data growth problem but didn't have the money to shell out more instances on their setup, so they just added more SATA disks to their nodes. 
Well, this is not a "planned" situation in Hadoop. When rebalancing the nodes, they found out that data was not being pushed out to these new disks .. 
So instead they had to reshape their cluster to stripe their disks via RAID - this allowed them to be able to add these new disks whenever they needed, without having to add new machines. Whenever a new disk needed to be added (or for that master if one disk failed), the node could just be taken out of the cluster, and hot swap a new disk was just a matter of minutes.
So this seems to be an argument against using a JBOD configuration, at least in Hadoop.
In Cassandra 1.2 it seems like extra-care has been added to utilize a JBOD configuration by default, but I believe the problem may be the same if you want to just add extra disks to your nodes..
See this link that describes this exact problem on Cassandra:
http://www.datastax.com/support-forums/topic/cassandra-on-jbod