Now that Hadoop has matured, it is natural to ask how it can
grow from being a POC project to actually integrate it further in the
enterprise. In this post we will review what it takes to “productionalize”
Hadoop and the typical tasks involved in doing this.
Implementing a new product in the enterprise usually
implicates adhering to the rules your IT team has laid out, often for
regulatory compliance. Outside of the specific constraints of Hadoop, a
non-comprehensive list of needs goes like this:
- Provides error handling and failure recovery.
- Has logging and monitoring.
- Data security (encryption, access authorization...).
- Deployment into Production vs. UAT environment or data folder structure and deployment automation.
- Provide easy access to data for production support for investigation purposes.
- Disaster recovery (DR).
- Master Data management: policies around data frequencies, source availability
- Provides error handling and failure recovery.
- Has logging and monitoring.
- Data security (encryption, access authorization...).
- Deployment into Production vs. UAT environment or data folder structure and deployment automation.
- Provide easy access to data for production support for investigation purposes.
- Disaster recovery (DR).
- Master Data management: policies around data frequencies, source availability
-
Concepts of Data Quality: enforcement through metadata driven rules, hierarchies/attributes.
-
Have a testing
and integration procedure cycle: from unit testing to user acceptance
testing.
-
Multi-tenancy,
with the assumption that the product of choice is shared across projects. Attention
must be given to storage and processing capacity planning.
-
Business
process integration: policies around data frequencies, source availability.
-
Lifecycle
management: data retention, purge schedule, storage, archival.
-
Metadata:
data definition, catalog, lineage.
Now let’s review this list in the context of Hadoop;
actually a number of these items is answered by the underlying framework
(regardless of the vendor). Other items will be managed by additional 3rd-party
components. Some concepts are newer than others. Here is a list of these in the
form of a table:
Feature
|
Tool
|
Error
handling
|
Hadoop Core, Falcon
|
Logging
and monitoring
|
Ambari, Cloudera Manager
|
Security
|
Sentry, Kerberos, Knox, Ranger,
Dataguise
|
Automated
deployment
|
Ambari, Cloudera Manager
|
Disaster
recovery
|
Cloudera Backup and Disaster
Recovery, Falcon
|
Master
data management
|
Pentaho Kettle, Talend
|
Data
quality
|
Talend
|
Testing
and integration
|
Apache MRUnit, PigUnit
|
Multi
tenancy support
|
YARN, Hadoop schedulers
|
Data
impact monitoring
|
Cloudera Navigator, Falcon,
Ambari, Cloudera Manager
|
Infrastructure
monitoring
|
Ganglia,
Nagios, Ambari and Cloudera Manager.
|
Life
cycle management
|
Falcon, Cloudera Navigator
|
Metadata
management
|
HCatalog,
Cloudera Navigator (search, classification), Falcon (tag search)
|
Tagging
and search
|
Cloudera Navigator, Falcon
|
As you see, some components are common across vendors for
the most part (like Security or Monitoring), and can be leveraged from the
vendor’s UI offering (Cloudera Manager, Ambari). Some are offered by 3rd
party vendors, like MDM and Data Quality. However a slew of these features is
covered by a tool unique to the vendor (although sometimes offered as an open
source product). In this post we will cover Apache Falcon as it relates to some
unique aspects of productionalisation of the Hadoop cluster; unsurprisingly its
concepts are somewhat also similar to what Cloudera Navigator offers.
Why Falcon?
After a POC, data has landed, scripts in Hive and Pig have
been written, data has been joined between disparate systems, and is showing some
aggregation in some BI tool. The pipeline of work manifests itself as a set of
data flows.
However once this data pipeline needs to be put into
Production, it generally needs to follow data governance requirements. This
materializes generally in the form of some Oozie
workflows, with in addition the need to orchestrate with other tools, like
distcp or Sqoop.
Apache Falcon attempts to answer data governance
requirements such as:
-
Data
impact analysis: what happens if some data feed gets moved? What if someone
changes these files, who is going to be impacted by this?
-
Monitoring:
there is a need to monitor not just the infrastructure, but also the full data
pipeline, as well as the ownership of it.
-
Late data
handling: data never arrive perfectly on time, due to the variety of
different sources; how to deal with this?
-
Replication,
retention of data: there are generally different replication policies for
the different data sets, i.e. raw data vs. cleansed data vs. production data.
-
Compliance
for auditing
These typically complicate the simple data pipeline. These
requirements cannot be simply translated into a simple one-size-fits-all
template; there needs to be some higher-level tool to answer these questions.
(Of note, Cloudera Navigator handles most of these requirements except
late-data handling).
Falcon automatically generates
Oozie workflows to answer these requirements, based on higher-level XML
workflows. This dramatically simplifies the process of answering these
requirements.
Vocabulary
Falcon poses the concepts of:
-
A Cluster,
which represents the interfaces to the JT, NN.
-
A Feed
which is the data itself
-
A Process,
which consume or produce feeds.
So essentially the user needs to define his/her entities via
these concepts in XML, then manipulate them to create modular, reusable pipelines of data.
The higher-order XML tags in Falcon currently cover OOTB
policies like replication, retention,
late data handling configuration. However they are also extensible, like
allowing solutions for encrypting, or general external transformations. Also it
allows engines other than the OOTB ones, like Spring batch instead of Oozie, or
Cascading instead of Hive/Pig.
Monitoring
Monitoring of the
data pipeline can be run as a combination of Falcon and Ambari. Ambari’s UI
lets you supervise the infrastructure, but also the Falcon alerts for pipeline
starts/end or error, pipeline run history, and scheduling.
Tracing: Lineage, Tagging, Search, Auditing
Tracing in Falcon allows you to visualize the different Falcon
components (feeds, etc) of the pipeline
linked together. This addresses the case when you want to make some changes
to a certain step and need to see an overview of these steps visually; in
Falcon this is called Lineage
information, represented visually, and stored internally in GraphDb. This
answers the impact analysis requirement.
Tagging provides a key/value pair to feeds and processes. I.e.: Owner = X. Data source=DW. This is
widely used to show ownership, business value, and the external source system
or the destination.
This also allows you to perform
search, for example to retrieve all of the feeds that are tagged “secure”
and are owned by X.
Auditing simply refers to logging all changes to the pipeline, i.e. the action taken along
with the user who took it.
Falcon user flow
In a nutshell, the process and user flow to work in Falcon
goes:
1/ Define your pipeline
definition; feeds, processes, etc. Define the pipelines and flows in XML.
2/ Submit these
pipelines from the Falcon CLI. The Falcon server validates the specifications
given.
3/ Launch and
schedule; the Oozie workflows are generated
5/ Manage the
workflow: at this point the user can check, suspend, resume the pipelines for
updates.
Architecture
The Falcon server is
essentially a centralized orchestration framework. It saves the XML definitions
and handle JMS notification mechanisms (via Active MQ) to subscribe and get
notifications about the pipeline. Ambari manages all this.
Closing thoughts on Falcon
Apache Falcon (and on the Cloudera side, Navigator) aims to
answer the requirements of data governance mandated by data stewards. The
higher-level definition and language of Falcon in regard to its entities
(processes, feed) is a step in the right direction. In practice, Falcon is
still relatively immature at this point (Nov, 2014) regarding lineage (very
limited UI, REST API hard to use), JMS notifications limited to completed
events only, late-data handling API too coarse, and no Sqoop support, but I
believe these limitations are only temporary.
Thanks for sharing the information that i was searching in Internet for a long time. This will be very helpful to my hadoop training chennai program.
ReplyDelete