Productionalization of Hadoop : Data governance ~ Big data

Now that Hadoop has matured, it is natural to ask how it can grow from being a POC project to actually integrate it further in the enterprise. In this post we will review what it takes to “productionalize” Hadoop and the typical tasks involved in doing this.

Implementing a new product in the enterprise usually implicates adhering to the rules your IT team has laid out, often for regulatory compliance. Outside of the specific constraints of Hadoop, a non-comprehensive list of needs goes like this:

- Provides error handling and failure recovery.
- Has logging and monitoring.
- Data security (encryption, access authorization...).
- Deployment into Production vs. UAT environment or data folder structure and deployment automation.
- Provide easy access to data for production support for investigation purposes.
- Disaster recovery (DR).
- Master Data management: policies around data frequencies, source availability

- Concepts of Data Quality: enforcement through metadata driven rules, hierarchies/attributes.

- Have a testing and integration procedure cycle: from unit testing to user acceptance testing.

- Multi-tenancy, with the assumption that the product of choice is shared across projects. Attention must be given to storage and processing capacity planning.

- Business process integration: policies around data frequencies, source availability.

- Lifecycle management: data retention, purge schedule, storage, archival.

- Metadata: data definition, catalog, lineage.

Now let’s review this list in the context of Hadoop; actually a number of these items is answered by the underlying framework (regardless of the vendor). Other items will be managed by additional 3^rd-party components. Some concepts are newer than others. Here is a list of these in the form of a table:

Feature	Tool
Error handling	Hadoop Core, Falcon
Logging and monitoring	Ambari, Cloudera Manager
Security	Sentry, Kerberos, Knox, Ranger, Dataguise
Automated deployment	Ambari, Cloudera Manager
Disaster recovery	Cloudera Backup and Disaster Recovery, Falcon
Master data management	Pentaho Kettle, Talend
Data quality	Talend
Testing and integration	Apache MRUnit, PigUnit
Multi tenancy support	YARN, Hadoop schedulers
Data impact monitoring	Cloudera Navigator, Falcon, Ambari, Cloudera Manager
Infrastructure monitoring	Ganglia, Nagios, Ambari and Cloudera Manager.
Life cycle management	Falcon, Cloudera Navigator
Metadata management	HCatalog, Cloudera Navigator (search, classification), Falcon (tag search)
Tagging and search	Cloudera Navigator, Falcon

As you see, some components are common across vendors for the most part (like Security or Monitoring), and can be leveraged from the vendor’s UI offering (Cloudera Manager, Ambari). Some are offered by 3^rd party vendors, like MDM and Data Quality. However a slew of these features is covered by a tool unique to the vendor (although sometimes offered as an open source product). In this post we will cover Apache Falcon as it relates to some unique aspects of productionalisation of the Hadoop cluster; unsurprisingly its concepts are somewhat also similar to what Cloudera Navigator offers.

Why Falcon?

After a POC, data has landed, scripts in Hive and Pig have been written, data has been joined between disparate systems, and is showing some aggregation in some BI tool. The pipeline of work manifests itself as a set of data flows.

However once this data pipeline needs to be put into Production, it generally needs to follow data governance requirements. This materializes generally in the form of some Oozie workflows, with in addition the need to orchestrate with other tools, like distcp or Sqoop.

Apache Falcon attempts to answer data governance requirements such as:

- Data impact analysis: what happens if some data feed gets moved? What if someone changes these files, who is going to be impacted by this?

- Monitoring: there is a need to monitor not just the infrastructure, but also the full data pipeline, as well as the ownership of it.

- Late data handling: data never arrive perfectly on time, due to the variety of different sources; how to deal with this?

- Replication, retention of data: there are generally different replication policies for the different data sets, i.e. raw data vs. cleansed data vs. production data.

- Compliance for auditing

These typically complicate the simple data pipeline. These requirements cannot be simply translated into a simple one-size-fits-all template; there needs to be some higher-level tool to answer these questions. (Of note, Cloudera Navigator handles most of these requirements except late-data handling).

Falcon automatically generates Oozie workflows to answer these requirements, based on higher-level XML workflows. This dramatically simplifies the process of answering these requirements.

Vocabulary

Falcon poses the concepts of:

- A Cluster, which represents the interfaces to the JT, NN.

- A Feed which is the data itself

- A Process, which consume or produce feeds.

So essentially the user needs to define his/her entities via these concepts in XML, then manipulate them to create modular, reusable pipelines of data.

The higher-order XML tags in Falcon currently cover OOTB policies like replication, retention, late data handling configuration. However they are also extensible, like allowing solutions for encrypting, or general external transformations. Also it allows engines other than the OOTB ones, like Spring batch instead of Oozie, or Cascading instead of Hive/Pig.

Monitoring

Monitoring of the data pipeline can be run as a combination of Falcon and Ambari. Ambari’s UI lets you supervise the infrastructure, but also the Falcon alerts for pipeline starts/end or error, pipeline run history, and scheduling.

Tracing: Lineage, Tagging, Search, Auditing

Tracing in Falcon allows you to visualize the different Falcon components (feeds, etc) of the pipeline linked together. This addresses the case when you want to make some changes to a certain step and need to see an overview of these steps visually; in Falcon this is called Lineage information, represented visually, and stored internally in GraphDb. This answers the impact analysis requirement.

Tagging provides a key/value pair to feeds and processes. I.e.: Owner = X. Data source=DW. This is widely used to show ownership, business value, and the external source system or the destination.

This also allows you to perform search, for example to retrieve all of the feeds that are tagged “secure” and are owned by X.

Auditing simply refers to logging all changes to the pipeline, i.e. the action taken along with the user who took it.

Falcon user flow

In a nutshell, the process and user flow to work in Falcon goes:

1/ Define your pipeline definition; feeds, processes, etc. Define the pipelines and flows in XML.

2/ Submit these pipelines from the Falcon CLI. The Falcon server validates the specifications given.

3/ Launch and schedule; the Oozie workflows are generated

5/ Manage the workflow: at this point the user can check, suspend, resume the pipelines for updates.

Architecture

The Falcon server is essentially a centralized orchestration framework. It saves the XML definitions and handle JMS notification mechanisms (via Active MQ) to subscribe and get notifications about the pipeline. Ambari manages all this.

Closing thoughts on Falcon

Apache Falcon (and on the Cloudera side, Navigator) aims to answer the requirements of data governance mandated by data stewards. The higher-level definition and language of Falcon in regard to its entities (processes, feed) is a step in the right direction. In practice, Falcon is still relatively immature at this point (Nov, 2014) regarding lineage (very limited UI, REST API hard to use), JMS notifications limited to completed events only, late-data handling API too coarse, and no Sqoop support, but I believe these limitations are only temporary.

Big data - tidbits of knowledge

Contact

Monday, November 17, 2014