Tuesday, September 10, 2013

HDFS 2 - Hadoop 2.x

Different points captured about the next version of HDFS - talk/meeting at Hortonworks

What high availability (HA) means in Hadoop 1.x vs 2.x

In 1.x, HA is implemented by:
- Linux HA
- Shared storage between NN instances.

In 2.x for HA you do not need a shared storage any more.
Nodes are journaled on a disk - any disk: RM, NN active, NN stand by, even DN (although not recommended).

New HDFS features:
-Write pipeline, append mode
- Ability to understand / take advantage of SSD's ; exposed at the app level.
- Removed the 400 M naming space of Hadoop 1.x in the NN, via the NN federation.
- Block management pool - will be moved to the DN in the next 2.x iteration.
- Snapshots. These will be stored in HDFS, in the same system.
- Short circuit reads : going to the local disk directly for faster response.
- Use of NFS v4 - no gateway
- n + k fail-over.
- Use of Protocol buffers (also implemented in next version of HBase). Will replace transparently Writable interface for serialization.
- Stinger / Tez initiative.


Post a Comment

Note: Only a member of this blog may post a comment.